Jul 14, 2023

(+) Understanding Optical Character Recognition

The following is a Plus Edition article written by and copyright by Dick Eastman. 

Do you have a document or even a full-length book that you would like to enter into a computer’s database or word processor? You could re-type the entire thing. If your typing ability is as bad as mine, that will be a very lengthy task. Of course, you could hire a professional typist to do the same, but that is also expensive.

We all have computers, so why not use a high-quality scanner? You will also need optical character recognition (OCR) technology.

OCR is the technology long used by libraries and government agencies to make lengthy documents available electronically. As OCR technology has improved, it has been adopted by commercial firms, including, ProQuest, and other genealogy-related companies.

For many purposes, OCR is the most cost-effective and speedy method available. OCR is much better and cheaper than hiring an army of clerk typists.

OCR is actually the second step in the conversion process. The first step is to scan the document or book in question, much the same as you would scan a photograph. The scanner converts each printed page to a bitmap file, a pattern of dots that actually comprise an electronic image of the page. Software that comes with the scanner stores the file on the computer’s hard drive in TIFF, JPG, or some other image format. 

Next, specialized optical character recognition (OCR) software is used to scan the image and convert it to text. Older OCR software would compare the individual letters in a stored image against stored bitmaps of specific fonts. These pattern-recognition systems worked well with high-quality scanned images of text that used exactly the same fonts as those expected by the software. In other words, it rarely worked very well. It was rare that the scanned images exactly matched the stored bitmap images of individual characters. Only a few years ago, OCR had a reputation for inaccuracy.

Today’s OCR programs have added multiple algorithms of neural network technology to analyze the stroke edge, the line of discontinuity between the text characters, and the background. Allowing for irregularities of printed ink on paper, each algorithm averages the light and dark along the side of a stroke, matches it to known characters, and makes a best guess as to which character it is. The OCR software then averages or polls the results from all the algorithms to obtain a single reading. 

The remainder of this article is reserved for Plus Edition subscribers only. If you have a Plus Edition subscription, you may read the full article at:*)-Plus-Edition-News-Articles/13228017.

If you are not yet a Plus Edition subscriber, you can learn more about such subscriptions and even upgrade to a Plus Edition subscription immediately at