UC Berkeley OCR document structure /math papers

Here are some pointers to our recent papers generally relevant to optical character recognition, but more specifically aimed at reading mathematical expressions from typeset documents. These do not deal with handwritten documents or real-time handwriting recognition.

If you wish to refer to material listed here that is not specifically marked as published, send me a note so I can tell you its current status.

I've been thinking about how to represent scientific documents for use on the world-wide web. So have a number of other people. I've tried collect, annotate, and organize some of the efforts to date, associated with a talk at ICDAR-97 (August 97, Ulm, Germany) on the topic of More Versatile Scientific Documents. Comments are more than welcome.

If you have only a slow connection to the network, you might consider deferring the downloading of the figures in this next paper. It is an html description of current work here in interactive zoning and OCR processing.

Benjamin Berman and Richard Fateman: Optical Character Recognition for Typeset Mathematics appeared in ISSAC-94.
But you might prefer the longer, more recent, and more polished paper (with Taku Tokuyasu, Benjamin P. Berman, and Nicholas Mitchell) Optical Character Recognition and Parsing of Typeset Mathematics. that appeared Journal of Visual Communication and Image Representation vol 7 no. 1 (March 1996), pages 2-15. (This is the version as sent to them.) They re-typeset it by hand.

You might also see a paper sent to SPIE (San Jose, 1996) by Richard Fateman and Taku Tokuyasu: Progress in recognizing typeset mathematics.

The titles are similar, but they represent snapshots in time, and the contents of the most recent of the papers is probably most useful. In particular, the SPIE paper is most up to date and more implementation-related.

A 1997 paper by senior Martin Proulx, on his experiments in parsing of mathematics. He found it better to follow center-lines rather than baselines. His results are summarized in A Solution to Mathematics Parsing. Related programs are available on request.

Here is a draft of a paper on how to separate mathematical material from text in a document, How to Find Mathematics on a Scanned Page.

Eventually, a long paper on the details of the lisp programs will appear. All the programs are intended to be made available to the public, and some are being used by the Institute for Experimental Mathematics (Univ. of Essen, Germany). Unless you are really interested in doing stuff in Lisp and have an urgent need to use this, maybe you should wait.

A paper based on a term project by Eylon Caspi, that deals specifically with the question of reading the source text of a TeX document and converting it into a semantically useful representation is called Parsing TeX. The corresponding programs for it are in Parsetex-public.tar. It deals with an (old version) of a standard table of integrals that was typeset in TeX by hand, and how to ``recognize'' it. The newer version has been reconstructed from much nicer semantically useful macros making the process much easier. In general, the task is impossible, and in many particular cases, it is possible but probably not nearly as useful as you might expect, since tuning the programs to translate newly-encountered idioms may be more costly than just typing the formulas in, in some suitable computer algebra system (or MathML editor).

Richard J. Fateman, fateman@cs.berkeley.edu