One thing about me that surprises people is that I read books, the kind printed on paper made with murdered trees. Why don’t I have a Kindle or some other magical book gadget? I’m not sure about the ecological tradeoffs if you get dead tree books from the library like I do; if 100 people make good use of a paper book, that could be a better deal than 100 people all charging lithium ion batteries to read in other ways. Could be. Hard to say. Another problem I have with reading books on digital devices is that I like to read outside in the daytime. In the overly bright SoCal sun, an actively lit display is simply not actively lit enough. The problem e ink was designed to solve is still a problem I have.
Of course I read books indoors too, maybe twice as often, sometimes in the dark. I know how to configure computers so that the text on them is, for me, optimally readable. But anytime I have looked into reading a book on some kind of computer, I’m always balked by what, for me, is a very suboptimal reading experience. This usually takes the form of some very irritating web browser shenanigans or worse that basically drive me crazy. Add to this the turf wars of each content provider trying stalk your reading habits and monopolize your dollars, and my lack of dollars; the whole proposition just doesn’t seem worth it.
Recently I was checking the public library’s website for a real book I had reserved and there was an "ebook" option with a button that said "Read Now". As a bit of a lark I clicked that and after allowing only one mysterious domain to run Javascript in my browser I was looking at the front of the book. That was impressively easy and free of charge. I started reading it and got about 100 pages into it before I had my meltdown. It was ok, but ok is often not good enough for me. I wanted to keep reading, but my way. I wondered, can I turn this thing into a proper text file? Of course there was all kinds of tricksy browser smoke and mirrors to keep you from even selecting passages (super annoying as I was quoting heavily from the text).
About 90 minutes after deciding to see if it could be done, I had a proper text file containing the book’s text. I’m not saying, run amok and do illegal and terrible things with properly copyrighted stuff, but sometimes responsible people can have legitimate uses for this kind of thing. For example I’m definitely going to do some kind of frequency analysis on the text (as I did here). Another possibly legitimate application relates to the book scanner I built. I never got the cameras configured, but I may revisit that project now.
I’m sure there are serious improvements I could make to my strategy but basically here is how I did it.
Auto-indexing - You need a way to automatically advance to the next
page in the reader. In the case of the setup I was using, a space
while the browser was focused would advance to the next page. What I
needed was a programmable way to make my computer think I pressed the
space bar when, in fact, I did not. With Linux you have complete
authority and control over what your computer does and because
simulating a space key press event is not something computer science
prohibits, Linux does not either. The way I chose to do this involves
the uinput
Python module that can
interface with the uinput
kernel module. Getting both of those
things available is the hardest part of this whole operation.
Here’s the tiny Python program I use to synthesize a space key press event.
#!/usr/bin/python '''Don't forget! You must load the kernel module with: sudo modprobe uinput''' import uinput import time ukeys= list() ukeys.append(uinput.KEY_SPACE) device = uinput.Device(ukeys) time.sleep(1) # Needed or things don't get set up properly. device.emit_click(uinput.KEY_SPACE)
When this program is run, a 1 second delay happens and then the kernel is told that the user input system has just pressed space (even though it hasn’t). The 1 second delay seems necessary ensure the input system is ready for events from this module. There probably is a way to shorten or eliminate this but I was being conservative. Another approach would be to look into expect.
Iterate each page - You need to repeat the steps for each page. I used a Bash script with something like this.
for N in {01..547}; do
: ... Do each page's stuff here.
done
Capture the image - If you can see it, you can probably dump a
bitmap of it somewhere. I use
ImageMagick's import
command which is
so clever it’s painful. First use xwininfo
to find out the window ID
of the browser where the material is. Then do something like this.
import -window "0x3200097" +frame ${OUT}/uncropped.png
This produces a PNG file called uncropped.png
containing the contents of
the specified window.
Next - Run the aforementioned Python script at this point. The idea
is that while further processing is transpiring, the (laggy?) browser
can be settling down. Note that this needs to be run as root. Also
since I didn’t want to put uinput
among my real Python modules, the
PYTHONPATH needs to be specified.
sudo \
PYTHONPATH=/home/xed/X/P/prog/py/uinput/lib/python2.7/site-packages \
./1s+spc.py
Crop - Often the specified window will contain all kinds of superfluous junk that should be cut out. It may be possible to combine this step with the import command, but I didn’t check this too closely. Here’s the command that I used to isolate just the targeted text part of the image.
convert -crop 700x740+50+167 ${OUT}/uncropped.png ${PNG}
The geometry option argument means that I want a 700 pixel wide by 740 pixel tall image taken from the parent image starting at 50 pixels over from the left edge and 167 pixels from the top.
OCR - Once you have the image of the text capture in a file you need
to convert it to text. This is where the computer needs to actually
"read" dots on the screen into semantic text. I used a program called
tesseract which I discovered is
astonishingly good. Debian/Ubuntu users can just use apt to install
tesseract-ocr
. Really just take a look at how hard this is to use.
tesseract ${PNG} ${OUT}/pp.${N}
That’s it. This will produce a text file containing the text found in
the image. The second argument will get a .txt
appened to it.
Concatenate - Just focus the browser and run the script. You should end up with a collection of text files. The point of this exercise can be highlighted by showing how easy it is to concatenate them all into a single file.
cat *.txt > entire_pp.txt
Ligatures - The OCR likes to faithfully transcribe
ligatures. This
can be fixed with some sed
like this.
sed -i 's/fl/fl/g' entire_pp.txt
Quotes - The same goes for fancy (open and closing quotes) and unnatural apostrophes. Same fix.
M-dashes - A tricker problem is that the OCR transcribed many of the
long dashes into the letter i
. This strange quirk makes a lot of
ugly misspelled words. It can probably be fixed by actually learning
how the tesseract program really works, but I already know how Unix
works, so…
$ function fixi { W=( $(echo $1 | sed 's/i/ i /g') ) ; for p in $(seq \
0 ${#W[@]}); do for P in $(seq 0 ${#W[@]}); do if [ "$P" == "$p" ]; \
then echo -n " "; else echo -n ${W[$P]}; fi; done; echo; done | while \
read N; do if [ ! "$(echo $N | spell)" ] ; then \
echo \"s/$1/${N}/g\" | sed 's/ \([^ ]*\)$/--\1/'; fi; done }
$ spell <entire_pp.txt | grep i | while read N; do fixi $N; done \
| grep -v sed- | tee dashcorrections
I know, it’s a horrible hacky mess but it does show a proof of concept of what is possible. This produces a list sort of like this.
"s/thinkingithat/thinking--that/g"
"s/themithe/them--the/g"
"s/muchior/much--or/g"
"s/lastithis/last--this/g"
"s/areijust/are--just/g"
"s/youihow/you--how/g"
"s/hellimaybe/hell--maybe/g"
"s/everythingithe/everything--the/g"
"s/seeiwith/see--with/g"
"s/walliwouldn't/wall--wouldn't/g"
"s/stoveicoffee/stove--coffee/g"
"s/Butiwell/But--well/g"
"s/noiGod/no--God/g"
"s/dammitidoesn't/dammit--doesn't/g"
There are some false positives, but overall it works well.
"s/antisabotage/ant--sabotage/g"
"s/vestigal/vest--gal/g"
You can use this with Vim or sed or whatever and fix things up. Running sed 510 times on a 564k file (547 book pages) took only 98 seconds on an extremely weak machine.
I’m told that converting the images to bitonal before sending them to tesseract will improve quality quite a bit, but my results were good enough to not bother.
So there you go, it is not just possible to convert an ebook into a text file, it’s not even that terribly difficult. Certainly manually converting a page or two that you’d like to not have to retype to cite is quite easy and worth considering.
Update - 2016-11-24
Dr. S points out another potentially easier way to automate the browser. "I usually use selenium for those things but it doesn’t always work." He also reassures me that I’m not the only one who’s ever been driven mad by bad (web) interfaces that require unnatural clicking.