Wednesday, April 2, 2014

Getting A PDF To Work Well On A Kobo

The mournful prelude

Over the past few years I've amassed a collection of PDFs on various topics. They were stored with the best of intentions. "Surely one day I will read this", I told to myself. Sadly, I never got around to actually reading them.

I have an iPad, but I dislike reading PDFs on it. Something about it doesn't mesh. I read PDFs I've printed. But printing doesn't scale both in cost and logistically.

I tried an eBook reader, a first generation Kobo. It worked pretty well for the one book I bought from Kobo. It failed miserably to work with PDFs. Zooming in and out on a slow e-ink screen was too painful. Even the new ones look like they're pretty poor with PDFs (https://www.youtube.com/watch?v=wkWVaPw3Fgs).

The problem is not the reader. The problem is the format. PDFs are not intended to flow naturally like HTML. They are a rigid document intended to reproduce an exact copy onto the paper for which they were designed.

My problem is compounded by the fact that most of the PDFs that I want to read are two columned. This makes them near impossible to view on a little 6" screen.

As a result, I left the PDFs to rot on Dropbox's servers. That is until the past few nights.

Revising my tool set

Recently my life changed. This gave me more time and need to read my PDFs than ever before. I now live in Florida. I'm also self-publishing a book on NoSQL (New Data For Managers). I should be able to research the content while sitting on a beach just 45 minutes from my house. A beach means that the iPad is out.

After a few hours of research on PDF to EPUB, I found Calibre is the best open source option. I used it back in the 0.4 or 0.5 days. It worked then to get non-purchased books onto the Kobo. Great for adding Project Gutenberg stuff. Terrible at adding PDF derived works. I gave it another chance though.

It's still fairly crummy at getting PDFs into a usable EPUB. Footers are included. They show up at random spots. Headers are the same. Random pagination from the PDF footers occurs throughout the text. Really distracting. Often those make the text unreadable.

To compound the issue, EPUBs might have just "2" pages. The cover and a really long page that's got all of the PDF content. If this happens, it's impossible to jump to page "40". Instead you have to forward through the device 40 times.

Turns out that there is a command line tool for adjusting 2 column layouts into normal page layouts. It can also crop the footers and headers off. Since I'm not a huge fan of command line tools involving dimensions in a PDF format, I looked to see if someone fronted it with a GUI.

Turns out that there is such a GUI. Using this and the command line tool made it possible to convert a journal formatted PDF into a readable EPUB. While the output is not perfect (in my experience an occasional last line gets cropped off at random), it is highly readable. The document is easy to load and read. So I'm probably off to the beach since I have to go to a bigger city to rollover my 401k anyway.

How to make the magic happen

  1. Download and install the following tools.
    Calibre - http://calibre-ebook.com/ (I found the website to be ahead of Linux Mint's repo).
    k2pdfopt - http://www.willus.com/k2pdfopt/
    journal2ebook - https://github.com/adasilva/journal2ebook

    You need to add k2pdfopt to your path. This enables journal2ebook to see it.
  2. Take a complex PDF like Amazon's Paper on Dynamo (http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
    and open it with journal2ebook.py. 
  3. Crop the PDF and click the "Ready!" button in the lower right. It will ask you where to save the PDF. This step will reflow the 2 column to a single page format.
  4. Open Calibre and add the document.
  5. Right click on the PDF and select Convert. In the wizard choose EPUB as the output format. Don't bother with the rest of the setting yet since most likely you will actually get no text from the PDF but a bunch of images. Finalize the conversion.
  6. On the book, right click again, but this time choose "Edit Book". Here you'll get to see inside the EPUB. There should be a file called index*.html in the tree to the left. Open that. The right click on a <p> and choose "Split at multiple locations". This will open a wizard. There is an option to enter a tag. You want to simply type "p". This tells Calibre to turn every <p> into it's own page.
  7. Now open the file "content.opf". Here you want to find the entry for the original index.html file. You should delete it in both it's entry under the <manifest> and under <spine> This will make the really big page disappear from the reader's view (you can also delete the index file).
  8. Finally save the EPUB changes under the File menu.

Non-happy path

You might have a really large document that gets more than one index.html. You will have to eyeball the output then to see exactly what you have to change. But the general idea is the same.

Enjoy

At this point you should be able to push the file to your reader (Kobo or otherwise). Most of the page should be visible. Occasionally the image might be a touch too large for the reader.

The nice thing is I can read at the beach now and don't have to spend any money. This is great since I'm bootstrapping my own consulting/product firm. Money in the pocket means food on the table. Until next time, thoughtful reading.

PS

One thing to keep in mind is that you can't increase the font size since you're dealing with pictures. You should be able to recreate the EPUB using the steps above, but play with some of the settings.

4 comments:

  1. "You need to add k2pdfopt to your path. This enables journal2ebook to see it."

    Please can you expand on this (both for Windows and Linux cases)? Thanks.

    ReplyDelete
  2. Another thought; it seems that calibre has a pretty nifty API: " calibre is written primarily in Python ... its design is highly modular... The modules interact with each other via well defined interfaces. " All of this would seem to suggest that your steps 4 to 8 could, in fact, be automated via some scripting. I am not volunteering right now, but you seem to be someone who would appreciate automation as a time-saving measure ... WDYT?

    ReplyDelete
  3. I have tried your steps, but get an error:
    " right click on a <p> and choose "Split at multiple locations". This will open a wizard. There is an option to enter a tag. You want to simply type "p"."
    When I type p and click OK, I get this response:
    AbortError: The expression p did not match any nodes
    (a full stack trace can be obtained from the clipboard.)

    ReplyDelete
    Replies
    1. I think you missed the step where you also need to click on the Wizard Button that appears on the pop-up form ...

      Delete