Synonym Expansion in Google

This may be old news to many: Google can do synonym expansion.

For example, a search for cataloging digital resources gives you about 20 million web pages containing those three words. Add a tilde in front of a word, and Google will also search for synonyms of that word. So cataloging ~digital resources returns over 60 million pages includes those words or synonyms of the word “digital”, such as “computer” and “electronic”.

This may be old news to many: Google can do synonym expansion.

For example, a search for cataloging digital resources gives you about 20 million web pages containing those three words. Add a tilde in front of a word, and Google will also search for synonyms of that word. So cataloging ~digital resources returns over 60 million pages includes those words or synonyms of the word “digital”, such as “computer” and “electronic”.

JCDL 2007

The ACM and IEEE put on their Joint Conference on Digital Libraries this week in Vancouver, B.C. While I was not able to stay for the full conference, which looked to have a great program, I was fortunate to attend a pre-conference tutorial on Tuesday, “Thesauri and ontologies in digital libraries“, starring Dagobert Soergel from the University of Maryland. This will not be a play-by-play; most of that can be found in the “workbook” (PDF from the same workshop given at ECDL 2006) Dr. soergel gave us. Instead, I will highlight a few points from the day.

The ACM and IEEE put on their Joint Conference on Digital Libraries this week in Vancouver, B.C. While I was not able to stay for the full conference, which looked to have a great program, I was fortunate to attend a pre-conference tutorial on Tuesday, “Thesauri and ontologies in digital libraries“, starring Dagobert Soergel from the University of Maryland. This will not be a play-by-play; most of that can be found in the “workbook” (PDF from the same workshop given at ECDL 2006) Dr. Soergel gave us. Instead, I will highlight a few points from the day.

Dr. Soergel’s premise, given at the start of the day, was that “the system should support the user in creating meaning”. Id est, after interacting with a thesaurus, the user should come away with a greater understanding of relevant concepts and their relationships to each other. Do note, however, that the user need not interact with the thesaurus directly. A UI design challenge is to develop an interface that integrates the structure and power of the thesaurus without requiring the user to navigate the vocabulary to find the preferred terms. Dr. Soergel did not have a solution to how this should be done, but did offer a few suggestions in that direction.

Different user groups (and, usually, different individuals) will have different preferred terms for the same concepts, an idea that is becoming increasingly acceptable among librarians. FRAD, for example, acknowledges that a concept or entity can have multiple preferred forms, each under a different authority, and this concept is key to the Virtual International Authority File. Dr. Soergel’s used the example of medical conditions; one term for a condition may be more useful to doctors, another to the layman. The system should be able to support both of these communities and more through the appropriate use of structured relationships among terms. A large, but not necessarily comprehensive, list of these structured relationships can be found in the above PDF (p. 191), along with a draft of how these relationships might be modeled in a relational database (p. 196).

We spent some time talking about multilingual thesauri, a topic far more complicated than I had initially realized. Translating a thesaurus into a new language does not make it multilingual unless there is a one-to-one mapping of terms used to express a concept in one language to terms used to express a concept in the other. To make this work, one must often invent terms for one of the languages to express a concept in the other. For example, German has no word for a watch (a timepiece you carry with you), even though it does have words for specific kinds of watches (e.g., Taschenuhr = pocket watch, Armbanduhr = wrist watch), so a term would have to be invented in German to match the English concept of a watch. Sometimes, different languages approach things from such different perspectives that even inventing terms will not suffice.

Despite the title, there was not much discussion of ontologies or digital libraries; I suppose “Thesauri” by itself lacks marketability. But on the topic of thesauri, the tutorial was informative and well-presented. This was my first visit to Vancouver, a lovely city that I hope to return to someday (hopefully without having to sprint through the terminals at O’Hare next time). Indiana is certainly lacking in oceans and mountains.

Off-Line Web Applications With Google Gears

One of the most frequent criticisms of the modern crop of web- and AJAX-based applications is the need for an Internet connection for them to work. After all, what good are Google Docs or webmail to you if you are on an airplane or facing a temporary connection interruption. Google has brought us one step closer to fixing that this week with its new Google Gears browser extension.

One of the most frequent criticisms of the modern crop of web- and AJAX-based applications is the need for an Internet connection for them to work. After all, what good are Google Docs or webmail to you if you are on an airplane or facing a temporary connection interruption. Google has brought us one step closer to fixing that this week with its new Google Gears browser extension.

As others have pointed out, Google is not the first to develop something like this. But putting their weight behind it should help increase the development and adoption of off-line web applications.

I can imagine several potential uses for Gears within the library. Someone could conceivably save a set of records within an OPAC web interface, and still have access to those records while taking their laptop through the stacks of the library (presuming the library doesn’t have adequate Wi-Fi, of course). Or a student could save several items from a digital collection and still have access to those pages when presenting them to a class later. These things can already be done, of course, but Gears should make them easier and more user-friendly.

Like many Google products, Gears is still in Beta (much like Google Reader, which has implemented Gears), which means there are still some bugs to work out. There is some suspicion that it was rushed out the door to meet the deadline of Google Developer Day. But hopefully Gears will soon be a stable, robust, functional API for delivering off-line web applications.

OVGTSL 2007 – Part 4 – RDA

The final part of the conference focused on RDA. I think Dr. Tillett is the third member of the JSC I’ve heard speak on RDA. Every time I hear one of them, I’m very encouraged that things are moving in the right direction, albeit haltingly.

The final part of the conference focused on RDA. I think Dr. Tillett is the third member of the JSC I’ve heard speak on RDA. Every time I hear one of them, I’m very encouraged that things are moving in the right direction, albeit haltingly.

RDA is much more principle-based than previous cataloging rules, which should serve us well. Unfortunately, one of the principles seems to be “Don’t scare the library administrators”. It is this that keeps RDA from being the revolutionary change if probably needs to be. By insisting on near-complete backwards compatibility, the JSC seems to be trying to say “Keep cataloging exactly the same way you always have, but here are your new reasons for doing it that way”.

But, as I said, progress is being made. Over time it should become clear which cataloging practices are not based on the principles enshrined in RDA (and, by extension, in FRBR). Perhaps future revisions of RDA will slowly weed these out. The decision to unwed RDA from ISBD and MARC is definitely good news. And there will be fewer instances of disparate pieces of information being combined into one unparsable data element (Hooray, they’re not calling them metadata elements!). This and more should make RDA a very useful content standard for non-MARC cataloging.

OVGTSL 2007 – Part 3 – Virtual International Authority File

Discussion of FRAD leads right into a project that includes the Library of Congress, OCLC, and several other institutions around the world to develop a Virtual International Authority File (VIAF). The first, proof-of-concept stage of the project involves experiments in combining the personal name authority files of the Library of Congress and Die Deutsche Bibliothek. The ultimate goal is to enable authority control on a global scale by matching and linking authority records from all the national libraries.

Discussion of FRAD leads right into a project that includes the Library of Congress, OCLC, and several other institutions around the world to develop a Virtual International Authority File (VIAF). The first, proof-of-concept stage of the project involves experiments in combining the personal name authority files of the Library of Congress and Die Deutsche Bibliothek. The ultimate goal is to enable authority control on a global scale by matching and linking authority records from all the national libraries.

Dr. Tillett mentioned several other projects with this same goal that have failed or not gone far enough. They all ran into obstacles matching the records reliably and consistently. What VIAF has going for it that these other projects did not is , basically, better matching algorithms and access to OCLC’s bibliographic database. This gives them an error rate of less than 1%.

As a side note, Dr. Tillett mentioned that the Library of Congress will have Unicode capabilities in their authority file by December or January.