Show HN: Adventures in OCR

blog.medusis.com

120 points · bambax · 3 days ago

Hello HN!

In a recent "Ask HN: What are you working on?" thread, I mentioned I was working on OCRing a large book:

https://news.ycombinator.com/item?id=41971614

The post generated some interest so I thought I would keep HN posted.

The book is Saint-Simon’s Memoirs -- an invaluable historical account of the French court under Louis XIV, full of wit, sharp observations, and of incredible literary value. I'm OCRing the edition of reference made between 1879-1930, that contains a lot of comments and footnotes: 45 volumes, ~27,000 pages.

Here's a link to a blog post that describes the techniques used so far (the project is still ongoing):

https://blog.medusis.com/38_Adventures+in+OCR.html

But you may also directly access the result here:

https://divers.medusis.net/boislisle/pub

This web app (not optimized for mobile, sorry) solves a tricky problem of preloading images efficiently. In short: preloading the next image isn't enough, since browsers will repaint if an image is moved, or scaled. Or browsers won't paint at all if visibility is hidden or opacity is zero, and will paint only when those values change. On an average, slow machine, this takes visible time. But if an image is simply behind another element, it will be painted, and the removal of the covering element or changing the z-index will not trigger a repaint.

(Preloading is important because it lets one review results fast; if one has to wait 150-200 ms between images it's simply discouraging).

Would love to hear feedback; happy to answer any question!

45 comments

pronoiac · 3 days ago

Oh wow! I've worked on turning PAIP (Paradigms of Artificial Intelligence Programming) from a book into a bunch of Markdown files, but that's "only" about a thousand pages long, compared to the roughly 27000 pages long of all those volumes. I have advice, possibly helpful, possibly not.

Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.

Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.

I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.

If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:

* giving each paragraph its own line

* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while

Show replies

ksampath02 · 3 days ago

You could try Aryn DocParse, which segments your documents first before running OCR: https://www.aryn.ai/ (full disclosure: I work there).

Show replies

eigenvalue · 2 days ago

Out of curiosity, I tried submitting the first 200 pages of the PDF he used to my new tool that I also submitted today [0] to Show HN, ( fixmydocuments.com ), and it generated the following without any further interaction besides submitting the PDF file:

https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...

I think it's not a bad result, and any minor imperfections could be revised easily in the markdown. My feature to turn the document into presentation slides got a bit confused because of the French language, so some slides ended up getting translated into English. But again, it wouldn't be hard to revise the slide contents using ChatGPT or Claude to make them all either French or English:

https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...

[0] https://news.ycombinator.com/item?id=42453651

Show replies

bondeau · 1 days ago

I’ve used Surya (https://github.com/VikParuchuri/surya) before. It is very good (on par with Google Vision, potentially better layout analysis), but yours is a challenging use case. I wonder if it would be useful.

lassenordahl · 3 days ago

OCR to original structure is a really fun problem! I did something similar in an internship for newspapers pre-LLM Vision models, and it ended up being a bunch of interval problems re-aligning and formatting the extracted text. Found that Azure's OCR model was the most accurate by bounding box, which helped a lot.

Funny how vision models would almost be able to one-shot it, modulo some hallucination issues. Some of the research back then ~2020 was starting to use vision models for layout generations.

Show replies