Command Line PDF Editing

As I’ve mentioned before, Acrobat’s JavaScript API lags far behind other Adobe applications. Its limitations turned a seemingly simple project I was working on into an exercise in futility.

Overview

I have a collection of a little over 5,000 PDF files, the output of an OCR job. Each file contains one page of a newspaper. Four pages put together would make one issue. My goal, then, is to take four PDF files (e.g., 1896-09-24_001.pdf, 1896-09-24_002.pdf, 1896-09-24_003.pdf, and 1896-09-24_004.pdf) and merge them together into one file (1896-09-24.pdf). Then repeat 1,300 times or so to get the rest of the issues.

Sounds like an ideal job for a small script. Unfortunately, Acrobat only gives JavaScript access to the file system for opening and saving files. It has no way to read a directory for a list of files, which is rather fundamental to the task at hand.

PyPdf Almost Works

At Jay Luker‘s suggestion, I tried out PyPdf. It seems to do everything I needed. And indeed it would, except it can’t read my PDF files. It does its job just fine with other files, but not these that OmniPage created. It seems the files are missing an attribute that PyPdf looks for, so I end up with a KeyError.

pdftk to the Rescue

So, after much gnashing of teeth, Jason Ronallo suggested I try pdftk, a simple command line tool that can merge and split PDFs, among other capabilities. To merge the issue noted above takes just one line:

pdftk 1896-09-24_001.pdf 1896-09-24_002.pdf 1896-09-24_003.pdf 1896-09-24_004.pdf cat output 1896-09-24.pdf
 

A simple Python script can just call pdftk repeatedly to take care of the whole collection.

Many thanks, pdftk and #code4lib.

Fun with Acrobat

In my last post, I noted the need to convert some PDFs from a format suitable for a printer to a format suitable for online reading. The PDFs of the Muncie Times that I receive are laid out as spreads for each printed sheet of paper. So, for example, the first spread of the PDF includes pages 48 and 1, the next spread includes pages 2 and 47, etc. Many of the pages also have various printer’s marks along the edge.

In my last post, I noted the need to convert some PDFs from a format suitable for a printer to a format suitable for online reading. The PDFs of the Muncie Times that I receive are laid out as spreads for each printed sheet of paper. So, for example, the first spread of the PDF includes pages 48 and 1, the next spread includes pages 2 and 47, etc. Many of the pages also have various printer’s marks along the edge.

First, I established a general algorithm for tidying this up.

  1. Make a copy of the document in reverse order and append it to the end of the document.
  2. Crop off the unneeded half of each spread (left for the odd-numbered spreads, right for the even-numbered spreads).
  3. Delete the printer’s marks from the margins.
  4. Add top and bottom margins.

If you’ve tried to do much automation with Adobe CS applications, you’ve probably encountered the well-documented JavaScript APIs that make the job much easier. Acrobat is special. Its API is very different, much more limited, and boasts horrible, often inaccurate documentation. Even getting Acrobat to recognize and run a script can be such a chore that I’ve taken to copying code into its JavaScript console and running it from there.

That rant aside, it wasn’t too difficult to accomplish the first couple of steps. Step 1 (assuming you’ve already opened the document):

var nPages = this.numPages;
for (i = 0; i < nPages; i++) {
	this.insertPages({
		nPage: nPages-1,
		cPath: this.path,
		nStart: i
	});
}

Step 2:

for (i = 0; i < this.numPages; i++) {
	if (i % 2 == 0) {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [11.25*72, 0*72, 22.5*72, 13*72]
		});
	} else {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [0*72, 0*72, 11.25*72, 13*72]
		});
	}
}

Note that all measurements must be in picas. Since 1 inch = 72 picas, I just multiply all of my values by 72. I probably could have made this more universal by letting the script calculate the width of the page and then divide that in half.

At this point I discovered an oddity of Acrobat. In other Adobe programs (and any image-editing program I’ve ever used), when you crop something, you define and area and remove everything outside of that area. Acrobat never removes any part of a page, it merely hides it. So while you have this document that looks like it has 48 pages, each 11.25 in. x 13 in., you really have a document that has 48 pages, each 22.5 in. x 13 in., which is to say a document twice the size that it needs to be.

In my search for a fix to this, I eventually came across this handy tip from the Acrobat 7 PDF Bible (p. 388):

If you want to eliminate the excess data retained from the Crop tool, you can open the PDF in either Adobe Photoshop or Adobe Illustrator. Both programs honor the cropped regions of PDF files cropped in Acrobat. When you open a cropped page in either program, resave it as a PDF.

Very helpful information, that, if the cure weren’t worse than the disease.

  1. Neither program can open more than one page of a document at a time. But I could write another script to do this part if that were the only problem.
  2. Photoshop rasterizes all of the text. Needless to say, that’s unacceptable.
  3. Illustrator can’t use embedded fonts if you don’t have them on your system and will replace them with whatever fonts are available. Since they have a Mac and I have a PC, this won’t work.

After mulling this over a bit more, I had an epiphany: print it! “Gasp,” you say, “wasn’t the whole point of this to avoid having to go through a print version?” Yes, but printing doesn’t have to go to a physical medium. In this case, I used the Adobe PDF printer that comes with Acrobat to print my PDF to a PDF. Incredibly, this worked. By setting a paper size to 11.25 in. x 13 in., I could print the document to a new, appropriately-sized document while discarding the excess data (and doing some optimization for online viewing while I was at it). Step 2 complete.

After discovering how to accomplish step 2, I realized that steps 3 and 4 could be accomplished in a similar manner. Crop the margins off and print to a new PDF with the margins I want, clear of any printer’s marks. As a matter of fact, these steps could be rolled into step 2. Simply take off an extra 0.45 in. from each side of the page, then print to a 11.25 in. x 13.75 in. page. So the new combined code for steps 2 and 3:

for (i = 0; i < this.numPages; i++) {
	if (i % 2 == 0) {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [11.70*72, 0*72, 22.05*72, 13*72]
		});
	} else {
		this.setPageBoxes({
			cBox: "Crop",
			nStart: i,
			rBox: [0.45*72, 0*72, 10.80*72, 13*72]
		});
	}
}

After that’s run, you create your custom paper size and print to PDF, centering the content on the slightly-larger page.