Looks like OCRs are still alive and kicking

Last time I used an OCR was… uh…

who am I kidding, I can barely remember what I ate for breakfast, let alone something that happened at least 20 years ago.
I remember I had this scanner connected through the parallel port to my PC.
It was a nightmare and it was dishearteningly slow.

But then there was also another big can of worm to deal with: at the time scanning was mostly about documents and guess what you had to do with those documents?
Yeah! You had to edit them!
And to modify them you had to use one of those things that still gives me shivers and are a recurring nightmare even after all of those years: OCRs.

For those too young to remember, an OCR is a software which “reads” an image (or PDF or what have you), recognizes the characters and spits out an editable document, which might come in several forms such as plain TXT files, RTF or Word DOC.
Also, good luck maintaining the original formatting of the text.
And if it was a document made with mixed text and images… yeah, you get the idea.

Yesterday I had to do a word search on a very old PDF manual I got from ye good ol’ internet and since it was from 1989 it was, unfortunately, one of those “pure scans” where, whoever did the job at the time, didn’t go through the fuss of making the scans straight and pass them through an OCR.
This meant no word search for me.

It also meant I had to use again one of those god forsaken softwares to try and have text recognized on this 200+ page PDF manual.

A quick Google search sent me to PDF24 which looks like a website with a collection of all kind of tools for managing PDF files which is pretty amazing.
I didn’t delve too deep, but I’m pretty sure that the backend is using some kind of open source software to do all the stuff.

Anyway: it is free to use, doesn’t come with any limitations and it looked “clean”, so I didn’t have much to lose. I just fed its OCR tool the manual and waited.

Good Lord, was it slow.
I think it took something like 1 hour to do the whole thing, but the final result was mind blowing: not only did it maintain the original text formatting, but also all the images and text alignment was perfect, it kept the original fonts and everything.
I had to double check the result because I was afraid that it did nothing, but almost every single character was recognized, even on those pages that were not scanned straight, no issue.

OCR software has come a very long way since the late 90s, I’m honestly astonished by the results.

I’ll leave here both the original manual and the OCR’d one so you can see it for yourself.

10/10 would use again.

Leave a Comment Cancel reply