16 November 2012

Digitizing your books

Here's what I've learned so far about digitizing books for personal use.

How should you scan your documents?

The scanning service I've tried is called 1DollarScan. It works well. For a base price of $1 per 100 pages, they will (destructively) turn your book into a PDF. They offer various extra services, each of which costs an additional $1 per 100 pages. Some of these extras are listed below.
  • 600 dpi (instead of 300 dpi)
  • Ship from Amazon.com. That way you can "pretend" any book on Amazon is available as a PDF!
  • OCR. I gave it a try but was not particularly impressed. So now I plan to do the OCR myself using Acrobat. Acrobat has an amazing option called "ClearScan" that replaces recognized characters with their representation in a custom font. I'm still keeping the original scans for reference, in case ClearScan messes up, but so far it has offered me about 18x compression at what appears to be an increase in quality!
  • Use book title as file name. I gave it a try, and it was fine, but now I plan to just name my files using their ISBN, e.g. "ISBN 1234567890". I plan to put the title and author in the PDF metadata. This keeps file names short, though it makes them unfriendly. I'm assuming this unfriendliness won't be a problem since I'll find things by searches of content as well as file name, where the content will include the title metadata as well as the OCR data for the images. If you do go with titles in file names, note that many titles can't be exactly represented as file names since they contain characters prohibited from file names. E.g. on Windows, a colon is prohibited, but it is a very common character separating title from subtitle! As one would expect, the title field of the PDF metadata has no such restrictions.

Now, where should you put your scanned PDF?



I've tried storing PDFs on Dropbox, Google Drive, and Scribd. I think Google Drive is the best overall. Here is a comparison of what these services offer.
  • Dropbox
    • No file size limit
    • 2 Gb overall limit for free accounts
    • Offline access
    • Native viewers (Acrobat, Preview)
    • Native search (Windows Explorer, Mac Finder)
  • Google Drive
    • 25 Mb file size limit for in-browser viewer and in-browser search
    • 5 Gb overall limit for free accounts
    • Offline access
    • Native or in-browser viewer
    • Native or in-browser search
  • Scribd
    • 100 Mb file size limit; much smaller for ClearScan?
    • No overall space limit for free accounts
    • No offline access
    • In-browser viewer
    • In-browser search
For reference, the books I've scanned have been in the 100-200 Mb range. So to accommodate either Google's 25 Mb limit or Scribd's 100 Mb limit, you would have have to split such PDFs up. In Acrobat, you can do this by adding bookmarks at the beginning of each desired section and then telling it to split the document accordingly.

Or, you can compress your documents using Acrobat's ClearScan and then they will probably fit even Google's 25 Mb limit. So far, I haven't gotten Scribd to accept ClearScan files that were bigger than 100 Mb before they were compressed, though. My guess is that when they are converted to Scribd's format, they end up being too big again.

One thing I noticed about Scribd's viewer: it doesn't show you the full resolution of your document, at least not if it is 600 dpi. I suppose you can download the PDF if this is a problem, but that might be mildly annoying.

Unlike Scribd, Google Drive allows files greater than 100 Mb. But, a PDF bigger than 25 Mb won't be searched or shown in the built-in viewer. In other words it is just an opaque (dumb) bunch of bits.

Okay so that's what I have to share about my experiences digitizing books. What follows is a postscript on the narrow issue of how to fix PDF search on 64-bit Windows.

Postscript: How to fix PDF search on 64-bit Windows

One of the big advantages of having your books digitized with OCR is the ability to search within a book and among all your books. To my dismay, this was not working for me on Windows.

The reason, I discovered, to my horror, is that on 64-bit Windows, PDFs won't be indexed unless you install a special "IFilter" program from Adobe. Once I did this and rebuilt the index, PDF search started working.

In my opinion, this is really bush league stuff from an otherwise major league company like Adobe. Though a little weird, I respect their choice to not have a 64-bit version of Acrobat. The 32-bit version works fine and I suppose Acrobat is unlikely to require more than the 2 Gb or so of memory that would trigger the need for a 64-bit version. But, they should have figured out how to install this 64-bit IFilter thing along with the 32-bit application, when it is being installed on a 64-bit OS.

I try not to indulge in Windows-bashing or Mac booster-ism unless I have something specific to say. So I guess I'd modify the adage
If you don't have anything nice to say, don't say anything at all.
to
If you don't have anything specific to say, don't say anything at all.
Here I have something specific to say. Mac's built-in PDF support is nice. You get a viewer (Preview) and search right out of the box. On Windows, it is mildly annoying that one must install a viewer (usually Acrobat Reader) on each new machine, but it is close to infuriating that indexing (search) doesn't work even when you do that install! Without some intrepid Googling to figure out that this is a 64-bit problem, you won't be able to figure out why indexing works on some machines and not others. (The answer is that some machines are running 32-bit Windows and some machines are running 64-bit Windows!)