The Digital Mortgage: Digital Preservation of Oral History
by Doug Boyd
From the moment an interviewer presses the record button on an audio or video recorder, the interviewer becomes the curator or caretaker of a precious and fragile unique item. Ideally, at the moment of creation, the digital file has begun its journey from the interview context to a stable archival repository ready to ingest the digital file into a sophisticated digital preservation system. However, many interviews are created without the person responsible for the oral history first making arrangements to preserve and access it. That is not a good practice, let alone the best practice. Avoid it if you can. If you read this paper and follow the advice, your materials are likely to survive to the point where they reach a durable archive, survive longer in their archive, and be accessed more easily. In a digital context, from the moment of creation, you are also preserving it, and of course, you will need to access it. Whatever “digital asset” you are creating must be curated with long-term sustainability as a major priority.
CORE CONCEPTS of DIGITAL PRESERVATION
You, as the curator of this digital object, are trying to stack the deck in your favor, reducing risks to long term preservation and access of your materials. Assume that whatever computer creates a digital file will eventually break down. The Hard Disk Drive (HDD) that initially houses the interview will eventually crash and need replacement. These calamities were less likely with audiotape and tape recorders, which often lasted 20 years or more. OHDA requires more planned, vigilant approaches to access and preservation. There are tactics and techniques for mitigating risk that work in a wide variety of institutional contexts and budgets. This article offers an introduction to these risks and tactics.
One good way is to partner with an established archive and preservation institution that has lives and breathes these issues, like The Louie B. Nunn Center at the University of Kentucky! A second is to dive in and create an archive and preservation system of your own. I highly recommend the first. However, either way, whether you partner with professionals or go on your own, you should know the following.
Redundancy and Distribution
Redundancy refers to the creation of multiple copies of the interview. Digital technologies provide us the luxury of having the ability to create exact duplicates of a digital object. Redundancy is one of the simplest aspects of digital preservation for an individual or a small institution to execute. Although merely making backups does not constitute a digital preservation plan, maintaining multiple instantiations (copies) of a digital file does represent a key component of a responsible digital preservation plan The core concept represented by the LOCKSS system–Lots of Copies Keeps Stuff Safe– is an effective core strategy for digital preservation. Maintaining multiple copies or instantiations of an interview is vital in the management of risk with regard to ensuring digital continuity of your digital interview. Redundancy could mean, quite simply, maintaining a local version of an interview and simultaneously maintaining a backup version of that interview. Personally, I prefer preservation strategies that retain three instantiations of an interview, that live in at least two different places.
What should you store your files on? Optical media, like CDs and DVDs, has fallen out of favor. Amazingly, the little 1’s and 0s that make up your files on the CD’s or DVD’s sometimes disappear or change (see Fixity, a bit later in the chapter). In the past few years, more degradation and instability than originally projected have made these options less desirable. If an interview is backed up on a CD-R or a DVD-R, assume that you will need to soon (consider doing it now) extract the digital files from that medium and migrate to something more stable. Like what?
Right now, the consensus preference for preservation oriented storage media is Hard Disk Drives (HDDs). Whether you are storing your backup on HDDs or on digital tape such as LTO (Long Term Open), also assume that this medium will also be short lived (though not as short lived as optical media), and that you will need to refresh the HDDs or digital tapes regularly. How regularly? Well, if you are doing it yourself, it depends on how long the HDDs or LTO are likely stay 100% healthy (ask!), how many copies you have, whether the copies are in different places, and how devastated you will be if something is lost forever.
If you use external Hard Drives, use RAID ready HDDs. RAID stands for “Redundant Array of Inexpensive Disks” which means that data is distributed across multiple drives that live in a rack. Below is a picture of one hard drive, then a picture of RAID storage, lots of slide in, slide out, HDDs in a rack. And, of course, RAIDS include a strategy for redundancy.
One, that I prefer for simplicity, is RAID 1 or “mirroring.” In this preservation approach, each file is stored twice, on different drives. An example of this would be a 2 TeraByte external hard disk drive that actually contains 4 TeraBytes of storage, simultaneously writing replica data to both drives. If one drive fails the other contains an exact copy of the data. Of course, you will need software that knows to do this – we’ll get to that later.
Still worried? If you will sleep better, increase the distribution of your interviews to three disks. Then be sure one copy is in a different building, away from a flood zone. Distribution can mean anything from storing data on two separate external HDDs and physically separating the HDDs between buildings, campuses, or states to……………… (fill in the blank). Network backup protocols typically utilize “off-site” backups of servers and the cloud (a big set of servers that changes as demand grows) is becoming a popular and a fairly affordable option.
Where can you find and archive if you don’t want to build it yourself or partner with the Louis B. Nunn Center? Here are some professional options. And, like visiting a doctor, be sure to ask questions – about how they keep your data preserved:
1. Regionally distributed digital preservation initiatives including the MetaArchive Cooperative (http://www.metaarchive.org/) and the Alabama Digital Preservation Network (http://www.adpn.org) which sends digital objects over a LOCKSS based network that distributes 6 different versions of the digital object stored in 6 different regional locations. These solutions are considered a more high-end preservation solution but serve as model projects for demonstrating the concepts of redundancy and distribution. Why use this? You really, really, really want to stack the odds against losing your materials. Of course preserving them won’t mean much if you can’t access them, but we will get to that later.
2. Cloud solutions are becoming increasingly accessible for both individuals and institutions. Right now Cloud based solutions, mostly targeted at households or individuals, usually require a monthly fee and back up a single computer online. Services such as www.carbonite.com or www.backblaze.com are good examples. The only caution with household or individual solutions is the risk that the company folds. Some of these providers have come and gone. Once again, this should be only one aspect of your strategy. Cloud-based solutions for larger institutions, such as OCLC’s Digital Archive, service is far more expensive, however, it provides the full range of digital preservation tools expected in a large-scale environment.
I often use a Starbucks analogy when thinking about digital preservation strategy. The following are my “Tall,” “Grande” and “Venti” preservation strategies. “Tall” represents small budgets with very little institutional technical support and infrastructure. “Grande” represents a medium-sized institution/budget with moderate support/infrastructure. “Venti” represents a sophisticated digital preservation system that is usually associated with large institutions, major budgets and major infrastructure, usually associated with a major university or research library. I have graphically represented redundancy and distribution-related strategies that work effectively with varying budgets. The following graphics were originally created for digital audio but also work as well for digital video.
A key goal of digital preservation is continuity – making sure ever bit – 1 or 0 – stays a 1 or 0. Continuity is dependent on many variables, fixity, or the assurance of future data file integrity is essential. Unfortunately digital data, which is magnetic, corrupts. Bits will rot even under some of the most ideal conditions, and once
the bits start to go, you have lost some of your file. Data corruption is personally very frustrating. From an archival perspective, it can be devastating. Most of the time we do not know whether our data file’s integrity is intact until we try to open the file and get a warning message to the contrary. It will ruin your day, I promise you.
Depending on an automated backup is risky. Why? If a file has suffered some rotten bits, and then it itself is backed up, your corrupted version of the file becomes duplicated. In this scenario, your unique, one-of-a kind digital oral history interview is lost. One way that archival repositories are creating an initial reference to reduce the risk of this happening is with a simple measurement called a checksum.
Put simply, a checksum is a measurement of data fixity and a confirmation of validity. Here’s how it work. Take a checksum measurement early in a digital file’s lifespan. Store it (do you need a checksum of the checksum?). then use that measurement to monitor and verify that the data is still intact on a regular basis. If the checksum is the same each time it is calculated, the file is very likely the same. There are several different algorithms for creating a checksum. The ones we use at the Nunn Center are the MD5 and the SHA checksums. Check them out. There are several free and inexpensive tools available to create a checksum. For Macintosh computers, I prefer a program called MD5 or File Hasher. If you use a PC I recommend Fast Sum. All of these are under $20.
The checksum should be captured as early as possible. The Nunn Center has implemented a workflow that captures the checksum before transferring the digital file from the original media accessioned into the archive. If the audio or video file is turned into the archive on flash media, the checksum is captured from the original media, prior to transfer. That checksum is then ingested into our collection management database classified as technical and/or preservation metadata, which serves as a reference point for all future checks of file integrity.
Neither you, nor an archivist, want to do checksums on individual files every day. When approaching an oral history collection on a modest or large scale, automation is the key. It will dramatically increase efficiency if the digital object being evaluated (checksumed) is safely housed in a networked environment. In this type of environment, systems can be implemented where checksum comparison can be automated on a mass scale. On a smaller scale outside of a networked environment, this process is difficult to automate. For example, if interviews are housed on external hard drives, those hard drives will have to be taken out of storage, powered up before a checksum can be verified. This can be labor intensive. At the very least, verifying checksums after migration (moving data from one place to another) is paramount to confirm that data were transferred and the original data file integrity intact.
Migration to different storage media
With the audiocassette, we knew that we could put a label on the tape, put the tape in the case and the case on a shelf, and, assuming you can find a working tape recorder, the tape is still playable 20 years later. This is due, in part, to the ubiquity and market dominance of the audiocassette for so long. Migration, or moving the audio off the tape to a different kind storage, was not such a big issue.
While digital data is more flexible and accessible, storage is not as durable and robust. And, to complicate matters further, standards and dominant formats are quickly changing. On the storage side, initial projections estimated the CD-R and DVD-R lifespans to be decades. This has been quickly adjusted to under a decade. Hard drives and servers need to be regularly refreshed every few years. The assumption of regular, future migration needs to be built into any digital preservation plan.
On the standards side, migration also refers to the fact that today’s formats, like .wav or .mov, files will need to be migrated to a new standard when new standards evolve and the commercial markets inevitably abandon the ubiquitous formats, and players, of today.
Broadly, migration speaks directly to the notion that preservation systems will evolve. Again, you need to build architecture and infrastructure with the implicit assumption that migration will be regular and thinking otherwise will only cause expensive future logistical barriers that will need to be traversed.
Migration happens constantly, in small and big ways. You will need to transfer content from the original capture media to external HDDs or to a networked server. That is migration. When the server backs up to tape or to the cloud, you are transferring the content. That is migration. Transfer of content via networks can be prone to failure due to recognizable or unrecognized human, physical or network error. Unique data should not simply be “copy and pasted” from one digital location to another. In an archival context, data should be transferred in a responsible and reliable fashion.
Whenever transferring content from location to location, I recommend command-line tools such as Rsync for Macs or Robocopy for Windows. For non-programmers out there, I know what you are thinking. “Command-line tools?” Yes, command line line tools. The Rsync and Robocopy tools built into Mac and Windows operating systems provide powerful bit-level migration tools. Although they take more time, the assurance of bit-level copying and verification that migration has succeeded is profoundly important. Simply copying and pasting, resulting in a file appearing in the new directory is not an assurance that a file migration was successful. We use command-line tools to handle file transfer at the accession and processing phases of our archival workflow because I do not take chances with my oral histories. The final transmission of the file will utilize (drumroll please) the Library of Congresses Bag-It protocol for deposit of our data package into University of Kentucky’s preservation repository. For more information on Bag-it, go to http://www.digitalpreservation.gov. Remember, simply copying and pasting files, which most of us take for granted as suitable, is not suitable for migration. To do migration you need to channel the anal, detailed, you into the task. Migration must be carefully monitored and verified to ensure digital continuity. I recommend that you map your workflow. Draw out your policy about what files go where. Articulate when surrogates need to be created (see the case study Born Digital Accession Workflow: The Louie B. Nunn Center for Oral History, University of Kentucky Libraries by Doug Boyd and Sara Abdmishani Price.
Format and system interoperability is another critical aspect of digital continuity. Format obsolescence is a very real phenomenon in a digital multimedia environment. It is important that interviewers are choosing recording formats that are open and/or ubiquitous formats. Proprietary formats will eventually change and may be abandoned by the vendors that created them, when newer, better formats evolve. Although digital preservation philosophy generally advises against proprietary formats, there is some level of safety in ubiquity. We learned many lessons in the beginnings of digital fieldwork with Minidisc and Digital Audio Tape. Both of these formats had highly attractive attributes, had great popularity among fieldworkers and were quickly abandoned when data file based recording systems became more affordable and when both proved to be difficult from a preservation perspective. Choose an audio or video recorder/camera that records in a format that your archival and preservation system can handle, but choose a format that is an open or ubiquitous format. For audio, the WAVE (.wav) format has become the ubiquitous standard for recording uncompressed audio. The modified Broadcast WAVE format (.bwf) has become a standard for preservation, mainly because of its capacity for embedding metadata into the file wrapper itself. This has not achieved ubiquitous adoption and many systems simply ignore the embedded metadata. However, the general consensus is that the (.bwf) will eventually become the standard. Most field recorders do not record in the .bwf format, so the Nunn Center accessions born digital formats as they were originally recorded. In most cases this is a .wav file for audio. Video formats are far more complex. Proprietary cameras use proprietary codecs to achieve the highest recording quality in the smallest file size. See Dean Rehberger and Scott Pennington’s Essay Video Equipment: Guide to Selecting and Use, Kara Van Malssen’s Essay Digital Video Preservation and Oral History and Doug Boyd’s case study Is Perfect the Enemy of Good Enough? Digital Video Preservation in the Age of Declining Budgets.
Interoperability is not just about format obsolescence. It is about the interoperability of metadata and systems. Many digital projects in the early 1990s were created in proprietary systems that eventually became obsolete. Initially, these digital projects had great impact. 15 years later, these same projects require major grant support in order to just make those systems work again. Current paradigms build in a more interoperable approach to creating and storing data and metadata. Consider what metadata schema you are using. Consider what system you are using to store your metadata. Do these systems work well with other systems? Do these systems allow future migration? If not, thousands of dollars will be spent down the road trying to move your content and digital objects. If not, reconsider.
Overwhelm the Future with Metadata
Information is power and good technical and preservation metadata collected in the present will transform future archivist’s abilities to effectively curate your digital assets. Two critical aspect of future digital continuity is technical and preservation metadata. The bad news is this contains a massive quantity of elements that would take hours to individual enter into a collection management system. The good news is that most of these elements can be automatically harvested using free tools such as MediaInfo. The following is an example of a MediaInfo export of technical metadata for a video file. The following video file is a .mov that was given to the Nunn Center from a video-editing studio. The .mov file is a wrapper containing a variety of variables including audio and video codecs , audio and video resolutions, audio and video bitrates, frame rates, color sampling etc. Each of these elements are critical for making a video file work in the future. While everyone may not understand what role each element plays in the technical process, it is key to document this metadata in an archival collection management system so that future obsolescence can be monitored.
- Complete name : /C2KY ProRes Master.mov
- Format : MPEG-4
- Format profile : Base Media / Version 2
- Codec ID : mp42
- File size : 1.40 GiB
- Duration : 13mn 16s
- Overall bit rate mode : Variable
- Overall bit rate : 15.1 Mbps
- Encoded date : UTC 2011-07-06 16:47:17
- Tagged date : UTC 2011-07-06 16:47:17
- ID : 1
- Format : AVC
- Format/Info : Advanced Video Codec
- Format profile : Main@L4.2
- Format settings, CABAC : Yes
- Format settings, ReFrames : 3 frames
- Codec ID : avc1
- Codec ID/Info : Advanced Video Coding
- Duration : 13mn 16s
- Bit rate : 15.0 Mbps
- Width : 1 920 pixels
- Height : 1 080 pixels
- Display aspect ratio : 16:9
- Frame rate mode : Constant
- Frame rate : 29.970 fps
- Standard : NTSC
- Color space : YUV
- Chroma subsampling : 4:2:0
- Bit depth : 8 bits
- Scan type : Progressive
- Bits/(Pixel*Frame) : 0.241
- Stream size : 1.39 GiB (99%)
- Language : English
- Encoded date : UTC 2011-07-06 16:47:17
- Tagged date : UTC 2011-07-06 16:47:17
- ID : 2
- Format : AAC
- Format/Info : Advanced Audio Codec
- Format profile : LC
- Codec ID : 40
- Duration : 13mn 16s
- Bit rate mode : Variable
- Bit rate : 157 Kbps
- Maximum bit rate : 237 Kbps
- Channel(s) : 2 channels
- Channel positions : Front: L R
- Sampling rate : 48.0 KHz
- Compression mode : Lossy
- Stream size : 14.9 MiB (1%)
- Language : English
- Encoded date : UTC 2011-07-06 16:47:17
- Tagged date : UTC 2011-07-06 16:47:17
- Material_Duration : 796629
- Material_StreamSize : 15671193
The Nunn Center’s collection management system, SPOKEdb, automatically harvests MediaInfo exports, parses out the key elements and places them in the appropriate database field. This allows the archivist to create a technical report on all of the interviews in the archival collection that were encoded using the H.264 video codec. When the H.264 codec becomes obsolete, the future archivist will have the information they need for future migration, because the necessary technical metadata has been saved. The following is a screenshot of the technical metadata section of SPOKEdb, the Nunn Center’s collection management systems. Most of the fields are automatically populated when we past in the export from Mediainfo.
There are many metadata schemas available for the archivist to choose from (See Elinor Maze’s essay on metadata standards). At this time, I particularly like the PBCore 2.0 (http://pbcore.org/). This standard was developed particularly for audiovisual material and excels at documenting the specifics of audio and visual technical metadata. The 2.0 revision of PBCore introduces many new innovations that make it particularly effective for documenting born digital content. This may not be the right schema for everyone. If your repository uses Dublin Core, try to maintain technical metadata for your audiovisual materials in a searchable database field in your archival management system.
Higher-end Open Archival Information Systems (OAIS), such as the one being implemented at the University of Kentucky requires staffing and great financial investment. The Trusted Digital Repository (TDR) standard is an exciting development, however, again, is a system requiring great resources that are not within reach of the typical small institution or individual. At the University of Kentucky the preservation repository utilize the METS standard, which was developed by the Digital Library Federation and excels at interoperability. METS is a wrapper that allows multiple metadata schemas to be wrapped into the same package. For example, the Nunn Center utilizes PBCore2 for descriptive and technical metadata and is using PREMIS for its preservation metadata. These individual components will be wrapped up in a METS file that will be use to ingest into your preservation system.
So, the ideal preservation context is becoming more and more accessible to institutions with budgets and staff, what is someone with budget and technical limitations supposed to do with regard to preservation oriented metadata?
- Born digital interview masters can often reside in parts spanning multiple files. Develop a workflow that tracks technical metadata for each of these files
- Be meticulous and consistent with file naming. This is critical to automation.
- Track technical metadata for each instantiation created. Strive for that technical metadata to reside in a searchable field in your archival management system.
- Harvest the checksum for each file accessioned and incorporate that checksum into your archival management system.
- Choose a metadata schema that works with your archival repository. Working outside of your institution’s dominant system can be counterproductive. Explore customizations of that schema that are particularly useful for oral history.
- Do not hesitate to ask a well-established archival repository about their metadata schema and strategy. There is no need to reinvent the wheel.
“Archiving” is no longer the final step in an oral history preservation workflow. It needs to be the first step. Interviewers designing a project need to be choosing a recording format that will be sustainable. Interviewers need to partner with an archive as soon in the process as possible. Choose an archive that has the capability to curate the type of collection that you are generating. Ask the administrators of that archive to articulate their preservation plan to you. If they do not have a digital preservation plan you should place your interviews elsewhere.
The following are some basic principles to keep in mind when designing a digital preservation system at any budget level.
- Maintain the original instance: The original files containing the recorded oral history interview should be curated and maintained.
- Preserve with Interoperability in mind: Some formats are highly proprietary. This is risky from a future focused perspective. You should utilize open formats and open protocols as much as possible. At time, however, use of proprietary formats is unavoidable. When this is the case, ensure that the format being use is at least a ubiquitous format
- Understand your Formats: Digital audio and video formats are made up of multiple elements. A video “file” such as an .avi or a .mov is merely a container containing elements such as multiple audio streams, a video stream made up of multiple elements containing multiple settings. Understanding the elements is key to future playability and compatibility.
- Monitor data file integrity (Fixity): Data corrupts. Simply storing a data file is not enough. Curators must ensure that the data being stored is being monitored for file corruption. An interview is of no use if a file is no longer playable. Backups are of no use if you have unknowingly backed up a corrupt file.
- Redundancy and Distribution-lots of copies in lots of places: There should be multiple versions of your interview stored, preferably in multiple locations.
- Pay attention to best practices: Many different entities have a vested interest in digital preservation of audio and video materials. We are not the only entities struggling with digital preservation. Pay attention at the national level.
- Avoid unnecessary re-encoding or transcoding of digital masters: Understand the consequences of compression to digital media files. The convenience of small file sizes will greatly impact future use.
- Monitor obsolescence: We live in a quickly changing digital environment. One of the greatest dangers of digital media technologies is obsolescence. When the commercial markets decide a format or technology is no longer viable, the disappearance of playback and retrieval abilities is swift. Take the minidisc, for example.
- Plan for future migration: Assume that, in the future, you will need to migrate your digital collection to new formats in accordance with current best practices in the future.
- Overwhelm the future with metadata: Knowledge is power. Make sure that oral history collections are well documented with administrative, technical, and descriptive metadata to empower future archivists to handle your digital assets.
- Partnership: Partner with an archival institution that has the most current capabilities in digital preservation. If you use a vendor, confirm that the system that you adopt is in line with standards and is implemented utilizing vendor independent standards. Have an exit plan, can you take your data with you when you leave or if the vendor goes out of business.
Online Resources on Digital Preservation:
The Library of Congress Digital Preservation Resources: http://www.digitalpreservation.gov/about/resources.html
Sound Directions Project: Digital Preservation and Access for Global Audio Heritage
Metadata Encoding and Transmission Standard (METS)
Northeast Document Conservation Center (NEDCC) Planning for Digital Preservation: A Self Assessment Tool: http://nedcc.org/resources/digital/downloads/DigitalPreservationSelfAssessmentfinal.pdf
The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials. Humanities Advanced Technology and Information Institute (HATII), University of Glasgow, and the National Initiative for a Networked Cultural Heritage (NINCH), 2002.
Digital Preservation Readiness Webliography
Tools for Digital Preservation
MediaInfo (technical metadata harvester): http://mediainfo.sourceforge.net/en
MD5 (Checksum): http://www.eternalstorms.at/md5/
Vendors and Services
OCLS, Digital Archive: http://oclc.org/digitalarchive
The MetaArchive Cooperative: http://www.metaarchive.org/
 Just use your favorite search engine to look it up.
Citation for Article
Boyd, D. A. (2012). The digital mortgage: digital preservation of oral history. In D. Boyd, S. Cohen, B. Rakerd, & D. Rehberger (Eds.), Oral history in the digital age. Institute of Library and Museum Services. Retrieved from ohda.matrix.msu.edu/2012/06/the-digital-mortgage/.
Boyd, Douglas A. “The Digital Mortgage: Digital Preservation of Oral History,” in Oral History in the Digital Age, edited by Doug Boyd, Steve Cohen, Brad Rakerd, and Dean Rehberger. Washington, D.C.: Institute of Museum and Library Services, 2012, ohda.matrix.msu.edu/2012/06/the-digital-mortgage/
This is a production of the Oral History in the Digital Age Project (http://ohda.matrix.msu.edu) sponsored by the Institute of Museum and Library Services (IMLS). Please consult http://ohda.matrix.msu.edu/about/rights/ for information on rights, licensing, and citation.