Show HN: Adventures in OCR
120 points ·
bambax
·
In a recent "Ask HN: What are you working on?" thread, I mentioned I was working on OCRing a large book:
https://news.ycombinator.com/item?id=41971614
The post generated some interest so I thought I would keep HN posted.
The book is Saint-Simon’s Memoirs -- an invaluable historical account of the French court under Louis XIV, full of wit, sharp observations, and of incredible literary value. I'm OCRing the edition of reference made between 1879-1930, that contains a lot of comments and footnotes: 45 volumes, ~27,000 pages.
Here's a link to a blog post that describes the techniques used so far (the project is still ongoing):
https://blog.medusis.com/38_Adventures+in+OCR.html
But you may also directly access the result here:
https://divers.medusis.net/boislisle/pub
This web app (not optimized for mobile, sorry) solves a tricky problem of preloading images efficiently. In short: preloading the next image isn't enough, since browsers will repaint if an image is moved, or scaled. Or browsers won't paint at all if visibility is hidden or opacity is zero, and will paint only when those values change. On an average, slow machine, this takes visible time. But if an image is simply behind another element, it will be painted, and the removal of the covering element or changing the z-index will not trigger a repaint.
(Preloading is important because it lets one review results fast; if one has to wait 150-200 ms between images it's simply discouraging).
Would love to hear feedback; happy to answer any question!
pronoiac ·3 days ago
Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.
Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.
I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.
If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:
* giving each paragraph its own line
* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while
Show replies
ksampath02 ·3 days ago
Show replies
eigenvalue ·2 days ago
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
I think it's not a bad result, and any minor imperfections could be revised easily in the markdown. My feature to turn the document into presentation slides got a bit confused because of the French language, so some slides ended up getting translated into English. But again, it wouldn't be hard to revise the slide contents using ChatGPT or Claude to make them all either French or English:
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
[0] https://news.ycombinator.com/item?id=42453651
Show replies
bondeau ·1 days ago
lassenordahl ·3 days ago
Funny how vision models would almost be able to one-shot it, modulo some hallucination issues. Some of the research back then ~2020 was starting to use vision models for layout generations.
Show replies