Libraries - x + 3

Fun with Acrobat

In my last post, I noted the need to convert some PDFs from a format suitable for a printer to a format suitable for online reading. The PDFs of the Muncie Times that I receive are laid out as spreads for each printed sheet of paper. So, for example, the first spread of the PDF includes pages 48 and 1, the next spread includes pages 2 and 47, etc. Many of the pages also have various printer’s marks along the edge.

First, I established a general algorithm for tidying this up.

Make a copy of the document in reverse order and append it to the end of the document.
Crop off the unneeded half of each spread (left for the odd-numbered spreads, right for the even-numbered spreads).
Delete the printer’s marks from the margins.
Add top and bottom margins.

If you’ve tried to do much automation with Adobe CS applications, you’ve probably encountered the well-documented JavaScript APIs that make the job much easier. Acrobat is special. Its API is very different, much more limited, and boasts horrible, often inaccurate documentation. Even getting Acrobat to recognize and run a script can be such a chore that I’ve taken to copying code into its JavaScript console and running it from there.

That rant aside, it wasn’t too difficult to accomplish the first couple of steps. Step 1 (assuming you’ve already opened the document):

var nPages = this.numPages;
for (i = 0; i < nPages; i++) {
	this.insertPages({
		nPage: nPages-1,
		cPath: this.path,
		nStart: i
	});
}

Step 2:

for (i = 0; i < this.numPages; i++) {
	if (i % 2 == 0) {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [11.25*72, 0*72, 22.5*72, 13*72]
		});
	} else {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [0*72, 0*72, 11.25*72, 13*72]
		});
	}
}

Note that all measurements must be in picas. Since 1 inch = 72 picas, I just multiply all of my values by 72. I probably could have made this more universal by letting the script calculate the width of the page and then divide that in half.

At this point I discovered an oddity of Acrobat. In other Adobe programs (and any image-editing program I’ve ever used), when you crop something, you define and area and remove everything outside of that area. Acrobat never removes any part of a page, it merely hides it. So while you have this document that looks like it has 48 pages, each 11.25 in. x 13 in., you really have a document that has 48 pages, each 22.5 in. x 13 in., which is to say a document twice the size that it needs to be.

In my search for a fix to this, I eventually came across this handy tip from the Acrobat 7 PDF Bible (p. 388):

If you want to eliminate the excess data retained from the Crop tool, you can open the PDF in either Adobe Photoshop or Adobe Illustrator. Both programs honor the cropped regions of PDF files cropped in Acrobat. When you open a cropped page in either program, resave it as a PDF.

Very helpful information, that, if the cure weren’t worse than the disease.

Neither program can open more than one page of a document at a time. But I could write another script to do this part if that were the only problem.
Photoshop rasterizes all of the text. Needless to say, that’s unacceptable.
Illustrator can’t use embedded fonts if you don’t have them on your system and will replace them with whatever fonts are available. Since they have a Mac and I have a PC, this won’t work.

After mulling this over a bit more, I had an epiphany: print it! “Gasp,” you say, “wasn’t the whole point of this to avoid having to go through a print version?” Yes, but printing doesn’t have to go to a physical medium. In this case, I used the Adobe PDF printer that comes with Acrobat to print my PDF to a PDF. Incredibly, this worked. By setting a paper size to 11.25 in. x 13 in., I could print the document to a new, appropriately-sized document while discarding the excess data (and doing some optimization for online viewing while I was at it). Step 2 complete.

After discovering how to accomplish step 2, I realized that steps 3 and 4 could be accomplished in a similar manner. Crop the margins off and print to a new PDF with the margins I want, clear of any printer’s marks. As a matter of fact, these steps could be rolled into step 2. Simply take off an extra 0.45 in. from each side of the page, then print to a 11.25 in. x 13.75 in. page. So the new combined code for steps 2 and 3:

for (i = 0; i < this.numPages; i++) {
	if (i % 2 == 0) {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [11.70*72, 0*72, 22.05*72, 13*72]
		});
	} else {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [0.45*72, 0*72, 10.80*72, 13*72]
		});
	}
}

After that’s run, you create your custom paper size and print to PDF, centering the content on the slightly-larger page.

Digital to Print to Digital, or, Running in Circles

Rule: Don’t add unnecessary, value-subtracting steps. If a process already has these steps in it, take them out.

Application: I’ve come to be responsible for an ongoing newspaper digitization project. Not a large project, by any means, but important for the library’s community relations. We (“we” being the Ball State University Libraries) created a digital archive of the Muncie Times, a local newspaper that is still published regularly.

Dealing with back issues was straightforward: scan and OCR. But, as I mentioned, the newspaper is still published regularly, so we get another issue every other week.

Rule: Don’t add unnecessary, value-subtracting steps. If a process already has these steps in it, take them out.

Dealing with back issues was straightforward: scan and OCR. But, as I mentioned, the newspaper is still published regularly, so we get another issue every other week. Here’s the workflow I inherited:

The publisher creates the issue using QuarkXPress.
The publisher exports a PDF and sends it to the printer.
The printer prints the issue.
The publisher sends a printed copy of the issue to the library.
The library scans and OCRs the issue.
The library puts the issue on the Internet

If you’re like me, you look at steps 3-5 and groan at the inanity of it. These steps made sense for the back issues that no one had retained a digital version of, but there is absolutely no reason, in this 21st century, to use printed newspapers in creating a digital archive of digital objects.

Here’s the new workflow:

The publisher creates the issue using QuarkXPress.
The publisher exports a PDF and sends it to the printer and the library.
The library puts the issue on the Internet.

It’s a miracle! Faster, easier, cheaper, and (most importantly) higher-quality, just by cutting out half the steps.

Caveat: The new step 3 isn’t quite so easy as it sounds. The first problem is getting the publisher to actually do step 2. The second problem (which I’ll cover in a bit) is converting the PDFs from a format suitable for the printer to a format suitable for online reading.

Did you mean: fluoride?

My dentist told me two noteworthy things yesterday: I need to floss more, and she misses the card catalog. I’ll leave aside my dental hygiene, it being a bit out of the scope of this blog, to focus on the latter.

She complained that the online catalog never works for her for one simple reason: she’s a horrible speller. With the card catalog, she could get to the general area and then thumb through the cards until she found what she was looking for. With an online catalog, a mistyped word gets you, “No results matched your query”, or some such. Then it’s off to the dictionary to figure out how to spell what you’re looking for. Or the user just assumes your library doesn’t have any relevant resources and goes to find the first match on Google.

There are some rather simple solutions for this that I have seen implemented. The catalog can suggest similarly spelled words when the user searches for an unknown term, much in the same way that Google or Amazon asks, “Did you mean: properly spelled word”. Or the user can land in a list of indexed terms that are nearby, alphabetically. (I’ll leave it to others to determine the optimal user interface for dealing with multiple misspelled words.)

The point is that our catalogs are failing our users, in this way among others. Someone would prefer, with good reason, to manually flip through printed cards rather than take advantage of the far greater search capabilities of the computer, because we haven’t replicated the functionality of a stack of paper. Vendors, why don’t we have these tools in place as a standard part of every catalog, of every journal database, of every digital library? It would be nice to finally offer quarter-century old technology to our users.

JCDL 2007

The ACM and IEEE put on their Joint Conference on Digital Libraries this week in Vancouver, B.C. While I was not able to stay for the full conference, which looked to have a great program, I was fortunate to attend a pre-conference tutorial on Tuesday, “Thesauri and ontologies in digital libraries“, starring Dagobert Soergel from the University of Maryland. This will not be a play-by-play; most of that can be found in the “workbook” (PDF from the same workshop given at ECDL 2006) Dr. soergel gave us. Instead, I will highlight a few points from the day.

The ACM and IEEE put on their Joint Conference on Digital Libraries this week in Vancouver, B.C. While I was not able to stay for the full conference, which looked to have a great program, I was fortunate to attend a pre-conference tutorial on Tuesday, “Thesauri and ontologies in digital libraries“, starring Dagobert Soergel from the University of Maryland. This will not be a play-by-play; most of that can be found in the “workbook” (PDF from the same workshop given at ECDL 2006) Dr. Soergel gave us. Instead, I will highlight a few points from the day.

Dr. Soergel’s premise, given at the start of the day, was that “the system should support the user in creating meaning”. Id est, after interacting with a thesaurus, the user should come away with a greater understanding of relevant concepts and their relationships to each other. Do note, however, that the user need not interact with the thesaurus directly. A UI design challenge is to develop an interface that integrates the structure and power of the thesaurus without requiring the user to navigate the vocabulary to find the preferred terms. Dr. Soergel did not have a solution to how this should be done, but did offer a few suggestions in that direction.

Different user groups (and, usually, different individuals) will have different preferred terms for the same concepts, an idea that is becoming increasingly acceptable among librarians. FRAD, for example, acknowledges that a concept or entity can have multiple preferred forms, each under a different authority, and this concept is key to the Virtual International Authority File. Dr. Soergel’s used the example of medical conditions; one term for a condition may be more useful to doctors, another to the layman. The system should be able to support both of these communities and more through the appropriate use of structured relationships among terms. A large, but not necessarily comprehensive, list of these structured relationships can be found in the above PDF (p. 191), along with a draft of how these relationships might be modeled in a relational database (p. 196).

We spent some time talking about multilingual thesauri, a topic far more complicated than I had initially realized. Translating a thesaurus into a new language does not make it multilingual unless there is a one-to-one mapping of terms used to express a concept in one language to terms used to express a concept in the other. To make this work, one must often invent terms for one of the languages to express a concept in the other. For example, German has no word for a watch (a timepiece you carry with you), even though it does have words for specific kinds of watches (e.g., Taschenuhr = pocket watch, Armbanduhr = wrist watch), so a term would have to be invented in German to match the English concept of a watch. Sometimes, different languages approach things from such different perspectives that even inventing terms will not suffice.

Despite the title, there was not much discussion of ontologies or digital libraries; I suppose “Thesauri” by itself lacks marketability. But on the topic of thesauri, the tutorial was informative and well-presented. This was my first visit to Vancouver, a lovely city that I hope to return to someday (hopefully without having to sprint through the terminals at O’Hare next time). Indiana is certainly lacking in oceans and mountains.

OVGTSL 2007 – Part 4 – RDA

The final part of the conference focused on RDA. I think Dr. Tillett is the third member of the JSC I’ve heard speak on RDA. Every time I hear one of them, I’m very encouraged that things are moving in the right direction, albeit haltingly.

RDA is much more principle-based than previous cataloging rules, which should serve us well. Unfortunately, one of the principles seems to be “Don’t scare the library administrators”. It is this that keeps RDA from being the revolutionary change if probably needs to be. By insisting on near-complete backwards compatibility, the JSC seems to be trying to say “Keep cataloging exactly the same way you always have, but here are your new reasons for doing it that way”.

But, as I said, progress is being made. Over time it should become clear which cataloging practices are not based on the principles enshrined in RDA (and, by extension, in FRBR). Perhaps future revisions of RDA will slowly weed these out. The decision to unwed RDA from ISBD and MARC is definitely good news. And there will be fewer instances of disparate pieces of information being combined into one unparsable data element (Hooray, they’re not calling them metadata elements!). This and more should make RDA a very useful content standard for non-MARC cataloging.