We (i.e., the Ball State University Libraries) are in the middle of a sheet music digitization project. We’ve already finished the easy part by scanning all of the music. All of the music is from a collection in the archives which has only been cataloged at the collection level, so we are left with the much more difficult step of cataloging every piece of music before we can add it to our Digital Media Repository (DMR).
Word from on high is that we will catalog everything in MARC/AACR2 and then write some sort of script (my job) to transform the data from that to our Dublin Core-based element set. As one might imagine, my heart sank upon hearing of this mandate. So much information in the MARC record is entered as free text, making it nigh impossible to get some information out without manual intervention.
Friday morning I had a sit-down with Sue, our music cataloger who will be heading up this phase of the project. We went down the list of fields we want in the DMR and identified where we might get find the data in the MARC records. Since we did this before the cataloging started, we were able to identify the ambiguous or free text MARC fields and agree on standards for explicitly coding these fields (e.g., using relator codes in the 100/700 fields) or using structured (but still natural-sounding) text to identify the relevant data. That should enable me to write a script that grabs everything I need (and ignores everything I don’t need).
Our biggest hurdle still to tackle with this is the subject headings. The MARC tags are just too ambiguous and LCSH too inconsistent for us to get meaningful data out of them. There is often no way to know if the heading indicates what the resource is or what it is about, especially when one starts mixing in the geographical headings and subfields. The problem may not be 100% solvable, but I would probably be content with 80-90% solvable if that is what we can reasonably do.
While it is great that Sue and I could get together to establish some workable standards before the cataloging starts, our solutions only go so far. It is quite possible that someone will want to pull other data out of the MARC record at some point but will not be able to because, although the information is in the record, it is not in a structured format that can be easily extracted. This, of course, takes us off on a much larger rant about MARC, AACR2, and other cataloging standards that I won’t get into right now.