Introduction

As part of an effort to incorporate printed material into a digitally stored library, we have been faced with digitizing potentially enormous quantities of printed paper documents. Assuming we can detect and handle separately half-tone photographs and other non-text material, the remaining text components may still be unconventional. For example, some of our corpus is tabular data, some includes mathematical notation.

How should scanned text be stored?

On the one hand, the cost of (especially tertiary) storage has dropped to the point where it can be economically feasible to base document-handling systems on storage and retrieval of pages as compressed bit-maps. Indeed, some business workflow applications are fundamentally data-base retrieval systems for such pages where externally imposed keys are used for indexing.

On the other hand, for many applications, including ours, it is useful and sometimes critical to be able to recognize at a good level of accuracy, the text contents of the scanned material. Such a recognized document stored as a bitmap plus a structured ``parsed'' text can be subjected to indexing, search, reformatting, computation, and economical re-transmission.

Typical commercial OCR programs are usually targeted at business text and simple tables; they tend to emphasize forms recognition, pre-''zoned'' documents, and high volumes of essentially similar documents. The commercial programs, while remarkably successful in certain domains, are commonly packaged to substantially preclude using and refining their tools to gain higher levels of recognition on unconventional material (e.g., mathematics), or for that matter, ``very'' conventional material: that is, pages that one can depend upon to be very limited in font usage although perhaps noisier than would otherwise be comfortable.

Our experiments with several commercial OCR programs used on our initial data of scientific (mathematical) text suggested that available monolithic commercial (proprietary) would not perform adequately on our tasks, and could not be modified for our purposes. Therefore we embarked, reluctantly at first, on a project to design and implement our own programs. The routines described in this document are the early fruits of this effort.

The modules described here are intended to be portable, re-usable, reasonably (and sometimes extremely) efficient. They are based on straight-forward designs, mostly mirroring what has been shown to be effective in the literature. For the most part, we have deviated from simplicity only when the simple solution was tried and found inefficient or inadequate. We expect that further development will follow the same route.

Next: Outline Up: A Suite of Lisp Previous: A Suite of Lisp

Class Account
Fri Dec 1 14:31:16 PST 1995