Metadata - x + 3

Rigid Cataloging = Flexible Data

We (i.e., the Ball State University Libraries) are in the middle of a sheet music digitization project. We’ve already finished the easy part by scanning all of the music. All of the music is from a collection in the archives which has only been cataloged at the collection level, so we are left with the much more difficult step of cataloging every piece of music before we can add it to our Digital Media Repository (DMR).

Word from on high is that we will catalog everything in MARC/AACR2 and then write some sort of script (my job) to transform the data from that to our Dublin Core-based element set. As one might imagine, my heart sank upon hearing of this mandate. So much information in the MARC record is entered as free text, making it nigh impossible to get some information out without manual intervention.

Friday morning I had a sit-down with Sue, our music cataloger who will be heading up this phase of the project. We went down the list of fields we want in the DMR and identified where we might get find the data in the MARC records. Since we did this before the cataloging started, we were able to identify the ambiguous or free text MARC fields and agree on standards for explicitly coding these fields (e.g., using relator codes in the 100/700 fields) or using structured (but still natural-sounding) text to identify the relevant data. That should enable me to write a script that grabs everything I need (and ignores everything I don’t need).

Our biggest hurdle still to tackle with this is the subject headings. The MARC tags are just too ambiguous and LCSH too inconsistent for us to get meaningful data out of them. There is often no way to know if the heading indicates what the resource is or what it is about, especially when one starts mixing in the geographical headings and subfields. The problem may not be 100% solvable, but I would probably be content with 80-90% solvable if that is what we can reasonably do.

While it is great that Sue and I could get together to establish some workable standards before the cataloging starts, our solutions only go so far. It is quite possible that someone will want to pull other data out of the MARC record at some point but will not be able to because, although the information is in the record, it is not in a structured format that can be easily extracted. This, of course, takes us off on a much larger rant about MARC, AACR2, and other cataloging standards that I won’t get into right now.

Dereferencing URIs

It’s a topic that has come up countless times in discussions of the Semantic Web (e.g.), and it came up recently on #code4lib: should all URIs be dereferenceable, or is it worthwhile to use non-HTTP URI schemes or non-resolving HTTP URIs?

The consensus from Semantic Web developers seems to be that URIs need not be dereferenceable, which has a certain amount of sense to it. It you give me the URI “http://jonathan.brinley.name/”, what would you put at the location “http://jonathan.brinley.name/”? If it’s a description of me, that description also has the URI “http://jonathan.brinley.name/”, giving us two resources with the same URI. With this data now in our system, we can make absurd statements like:
<http://jonathan.brinley.name/> <#describes> <http://jonathan.brinley.name/> .This is all very ambiguous, since it could be saying:

I’m describing myself
I’m describing the document at “http://jonathan.brinley.name/”
The document at “http://jonathan.brinley.name/” is describing me
The document at “http://jonathan.brinley.name/” is describing itself

Thus the GIGO principle rears its ugly head. If you give two separate resources the same URI (which is supposed to be a globally unique identifier, remember), then you should expect ambiguity to follow. If you want to identify something uniquely, and that something is not on the web, you should give it a distinct URI from something that is on the web.

So, that answered, we turn to the second half of the problem: is it worthwhile to use non-HTTP URI schemes or non-resolving HTTP URIs?

The recent discussion started with a mention of “info” URIs. These can be used to uniquely identify resources, but have the (potential) drawback of not being dereferenceable. As established above, non-dereferenceability is not inherently bad. If one simply wants to identify something uniquely, the “info” scheme will work, as will several other schemes.

But there is a certain utility in dereferenceability. As edsu asked: “if you were processing an xml file that included a particular namespace wouldn’t it be nice to get a document that describes that namespace without resorting to google?” This is a place where the HTTP scheme can still be useful, even if the resource itself isn’t available on-line. Nothing says a server has to respond to an HTTP Get request with either a 200 “OK” or a 404 “Not Found”. A 303 “See Other” is a perfectly reasonable response to a request for a particular resource, when all that can be provided is a description of that resource. The server can then point to the URI where this description does reside, which will be distinct from the URI for the resource it describes.