I have a collection of a little over 5,000 PDF files, the output of an OCR job. Each file contains one page of a newspaper. Four pages put together would make one issue. My goal, then, is to take four PDF files (e.g.,
1896-09-24_004.pdf) and merge them together into one file (
1896-09-24.pdf). Then repeat 1,300 times or so to get the rest of the issues.
PyPdf Almost Works
At Jay Luker‘s suggestion, I tried out PyPdf. It seems to do everything I needed. And indeed it would, except it can’t read my PDF files. It does its job just fine with other files, but not these that OmniPage created. It seems the files are missing an attribute that PyPdf looks for, so I end up with a KeyError.
pdftk to the Rescue
So, after much gnashing of teeth, Jason Ronallo suggested I try pdftk, a simple command line tool that can merge and split PDFs, among other capabilities. To merge the issue noted above takes just one line:
pdftk 1896-09-24_001.pdf 1896-09-24_002.pdf 1896-09-24_003.pdf 1896-09-24_004.pdf cat output 1896-09-24.pdf
A simple Python script can just call pdftk repeatedly to take care of the whole collection.
Many thanks, pdftk and #code4lib.