\documentstyle[mathematica,psfig,preprint]{acmconf}
\title{Parsing of two-dimensional mathematical images}
\author{
Nicholas Mitchell\\
Richard J. Fateman\thanks{This work was supported in part
by NSF Grant number CCR-9214963 and
by NSF Infrastructure Grant number CDA-8722788.
}\\
Computer Science Division, EECS Dep't\\
University of California, Berkeley}
\begin{document}
\maketitle
\psfiginit
\section{Introduction}
Given the locations of glyphs (characters) on a two-dimensional surface,
how do we interpret them as algebraic expressions? More explicitly
perhaps, how can we write a program so that a computer interprets
the location and identities that we see as $3x^2$ as the moral
equivalent of {\tt (times 3 (power x 2))}?
Success in this task provides, among other things:
\begin{enumerate}
\item Technology for parsing of mathematical data such as integral tables,
mathematical or scientific journals, etc.
\item Demonstration of a ``niche'' 2-D parsing technique that
could be a model for other syntax-directed
parsing styles. These might include parsing of stylized tables,
tabular music, chemical formulas. and other tasks.
\end{enumerate}
For our purposes here we assume
that an OCR phase (optical character recognition) feeds us the
correct glyphs and positions from a scanned page \cite{benb}. A somewhat more
realistic evaluation of the current state of the art of OCR suggests
that this lower level will not be perfect, especially in the presence
of noise, and that some efforts, in a complete system, must be
devoted to correcting glyph errors.
\section{Prior work in 2-D Parsing}
Historically, 2-D parsing was address by several groups
in the late 1960's \cite{anderson,martin,wells}
as well as recent research \cite{chou89}.
The 60's research seems to have been a mixture
of pragmatic work and
rigorous syntactic treatment of the subject.
{Anderson's parser is elegant and handles a very large
domain of mathematical expressions. However, these
two niceties add up to a slow algorithm. At the time
of his work (1969), his recognizer appears to have been tested
to its limits with
equations having about eight symbols.
A much
faster implementation of a subset of the parser by Lewis \cite{lewis}
provided demonstrable evidence of practicality.
Even a modest-size formula, omitting
the citation of references, will tend to have 30 or more characters
(see figure 1).
%$$\int_0^{\pi\over2} {{\sin 2 n x} \over {\sin^{2 n+2} x}} {{dx} \over
%{\exp(\pi \cot x) -1}} = (-1)^{n-1} {{2 n-1}\over{4 (2 n+1)}} \leqno{(1)}$$
%Here we have omitted some of the complexity:
%
%this integral has {\bf fifty-seven} characters.
While it may seem that Anderson's work solves (at least in theory)
a problem rather close to ours,
in fact there is a significant difference.
He used as
input writing from a Graphicon tablet (a stylus/pad device). In his
circumstances, noise is presumably filtered out
by the handwriting-recognition program which insists
that the user re-write an ambiguous or unrecognizable
character.
{While our input is also noisy
(text typeset fifty-four years ago \cite{dehaan},
is sure to be somewhat noisy regardless of how expensive our
scanner is), the character of the noise
is different: the characters are sometimes broken and
sometimes joined. (However, to a first approximation
we too rely on the noise being removed)}
Although the relative positioning and size of the characters
is variable depending upon the human doing writing; all expressions
are relatively uncomplicated.
Our input has considerable page-level structure
(page numbers, section headings); and yet, within a particular corpus
is uniformly typeset with respect to font and spacing.
Some of our expressions
continue for several lines, and in some cases span pages.}
Chou's work \cite{chou89} by contrast views the recognition of equations
as a particular domain, and that recognition
of 2-D images is fundamentally a signal processing task to decode
a structural (syntactic) signal corrupted by noise \cite{chou92}.
This is, in our view, a more elegant, but perhaps less pragmatic
approach; by contrast
we hope our own programs may assist in providing a realistic
level of performance in terms of speed and ability to handle
specific complex texts.
\section{A motivating example}
\begin{figure}
\centerline{\psfig{figure=/home/central/users/fateman/papers/tab1.6.ps}}
\caption{formula 18.6 page 45 from deHaan.}
\end{figure}
Figure 1 is an expression from a table of definite integrals
\cite{dehaan}.
It is extracted from table 18, which contains selected definite integrals with
limits from $0$ to $\infty$.
The original page was scanned at 400 dots per inch. The
reproduction from Postscript is at
at 300 dots per inch. Note that the subexpression
$x^p$ is ``one character'' as are the letters ``ng'' in {\em Tang}
(the name for tangent). Note that, contrary to current practice,
the names of trigonometric functions are in italics here.
Although many of these expressions cannot be derived by
computer algebra systems, this one can: however,
the neat formula given is much
smaller than that obtained from (for example)
Mathematica 2.2.
$$ {{\pi \,\left( -{1\over 2} - {q\over {2\,p}} \right) \,
\left( 1 + {q\over {2\,p}} \right) \,\tan ({{\pi \,q}\over {2\,p}})}
\over {2\,p\,\left( -1 - {q\over {2\,p}} \right) \,
\left( {1\over 2} + {q\over {2\,p}} \right) }}.
$$
Interestingly, the computer algebra system's
result as well as the table's is incorrect, say when $p=1, q=2$.
Macsyma 417.100, appears
to do somewhat better because
before giving an answer, it asks ``Is $(q+p)/(2 p)$ an integer?''
and Maple V returns it unevaluated.
In any case,
our parsed and partly simplified version would look something like this:
\begin{verbatim}
(integral-rule x 0 infinity ;;variable, limits
(times ;; the integrand
(plus (power x q) -1)
(power (plus (power x p) -1)
(times -1
(power x (times -1 p))))
-1)
(power x -1))
(times 1/2 pi (power p -1) ;;result
(tan (times 1/2 q pi (power p -1))))
nil) ;;no conditions
\end{verbatim}
The \TeX\ version
would be something like this
\begin{verbatim}
$$\int_0^\infty {{x^q -1}\over{x^p - x^{-p}}}
{{dx} \over x}
= {\pi \over {2 p}}
\tan {{q \pi} \over {2 p}} $$
\end{verbatim}
and after processing would appear this way:
$$\int_0^\infty {{x^q -1}\over{x^p - x^{-p}}}{{dx} \over x}
= {\pi \over {2 p}} \tan {{q \pi} \over {2 p}} $$
\section{Simplifying our task}
Rather than viewing the equation recognition process as a
parsing problem of a completely arbitrary 2-D
analog of a context-free grammar, it seems to pay (as in the
usual programming language case) to restrict ourselves to a
subset of the context-free languages that is easier to parse,
and yet sufficiently descriptive to handle most, if not all,
of the cases of interest.
Our current program is not based on an automatic parser-generator,
but on a somewhat mechanically generated variant of a ``recursive
descent'' parser, and hence a grammar that maps into such a
parser is particularly desirable.
Conceptually, we may view it as mirroring the way that people
read mathematics: almost uniformly from left to right along
some center-line of an expression, with some deviation up and down.
Without too much loss in generality, we can consider the typical
use of \TeX\ as a generator for typeset equations; more restricted,
(unrealistically so, for our task)
we can consider the subset of \TeX\ that is generated as output from
computer algebra systems. (Macsyma, Maple, Mathematica all can
generate, as an option, \TeX\ input).
\subsection{Center lines}
This uniformity of type-setting is probably the
most important simplifying factor. Looking at the
integral above,
we see that the integral is roughly centered above and below
an imaginary horizontal line drawn through the expression.
The line divides the integral sign in half, then
follows along the two horizontal divide-bars,
splits the equals-sign, and so on. We can either use this center
line as a key, or slip down about a half-character-height to
the ``baseline'' for the expressions to follow. Font description
metrics provide measurements with respect to the baseline (Characters
such as those with curved bottoms often drop slightly below the baseline;
characters with descenders like p or j drop well below the baseline.)
For our current discussion, let us stick with the centerline, although
for the letter $p$, the geometric center of the bounding box for the
character is quite inappropriate.
A parse proceeding {\bf left to right} along this main
line can very easily output pre-ordered
expressions: notice that we will see
the horizontal divide bars before the
numerator or denominators; likewise we
will see the integral (or summation or product) sign
before the limits or integrand of the integral.
Furthermore, this method works well with
a recursive style of parsing, because
each sub-expression (such as a
numerator) is also centered about a line offset in some direction
from the main center-line.
An exception to the left-to-right method occurs in the case of
a multiple-line formula. Here we consider that we have broken the page
into horizontal strips and placed the second and subsequent
lines of an integral formula (these lines do not have the
integral sign on them), to the right of the main line, aligning
the major operator (usually a $+$). Recognizing a multiple-line
formula is somewhat {\em ad hoc}: a line that does not begin
with a formula number and integral sign, but begins with an
operator $+$ or $-$ is a continuation.
\subsection{Non-ambiguity}
It seems obvious that table entries should be easy to interpret
unambiguously; this is not always the case, however. The mind-set
of the table-compilers may include precedence rules, notations,
abbreviations, semantics for some functions,
and other context that differ from modern
conventions. Furthermore, we have encountered some
nasty ploys to save space on the page, some requiring understanding
of English (or French or Russian) directives. These will probably
not be handled entirely automatically.
Note that disambiguation rules used for one table of integrals may
not serve
for others. DeHaan \cite{dehaan} for example has a particular
notation that changes the meaning of the solidus ($/$) in exponents
via the ``Notation de Kramp'' such that $c^{a/b}$ is $c (c+b) (c+2 b)\cdots (c+(a-1) b)$ and thus $1^{a/1}$ is our usual $a!$.
Some kinds of expressions are disambiguated by size, and others
by context. $e^{r_y}$ vs. $e^r y$. Adjacency can mean function
application as in $\sin x$, multiplication $ a x$;
adjacency of $dx$ signals ``integration with respect to $x$''.
Space also are seen separating
the formula number,
the formula, and the reference citation.
\section{Problems and some solutions}
As discussed
by Anderson \cite{anderson}, the distinguishing feature of
two-dimensional parsing (compared to linear parsing) is the need
to keep track of additional relationships between the terminals
and non-terminals of the language.
Any particular parse may be under several geometric inequality
{\bf constraints} specifying the locations of
bounding boxes and centers for one expression relative to other.
These can be tracked via attributes in a syntax-directed scheme,
or additional components in the terminal and non-terminal objects
handled in a recursive-descent parser.
The problems and solutions can be grouped into those which
are likely to be common over most of the equation-processing
texts, and those which are corpus-specific.
{\subsection{Those with general solutions}
In the expression
$${1+e^x}\over{e^y}$$
notice that the horizontal line breaks the space into
two regions and that these two regions must not
interact. That is, we should not be able to think that
the $y$ has an exponent of $e$, because one lies below the line,
the other lies above the line.
Thus, the parse of the numerator is under the constraint
``the y-coordinate must be greater than some value''.
More generally, a parse may be under $n$ spatial constraints
if it is parsing an expression that is nested $n$ subexpressions
deep.}
{\subsection{Corpus-based considerations}
\subsubsection{Segmenting the page}
We must segregate descriptive text, titles, line-numbers,
footnotes, as well as page decorations such as footers, headers,
line-numbers, etc. Some of these may be sufficiently complicated
as to recommend pre-processing by hand. In our initial tests
we have generally used a ``rubber-band rectangle'' technique
to visually separate out equations from a scanned page. The cropped
rectangle is then processed further.
\subsubsection{Intra-expression spacing}
Consider the simple expression $e^x$.
In a parse starting with the e, we have
to be able to determine how far away to look
for the exponent. Similarly,
consider the expression $e^{2 n+5} 3^x$. While
analyzing the expression $2 n+5$, we have to, at
some point, decide when we are done. If we miscalculate
we might end up with the expression $e^{2 n+5 x} 3$.
\subsubsection{Special fonts and abbreviations}
Some texts may use additional type fonts (Gothic, Black Forest,
an occasional Hebrew letter); some use $l$ for log, sn for $\sin$,
or $Tang$ for $\tan$.
Many of these types of problems have solutions which can
be parameterized on a
book-specific basis. Papers appearing in a particular
journal, having been typeset
historically by the same mechanical setting process will mostly likely
be consistent in their spacing conventions for such details as
the relationship of an exponent and
a base. Therefore, it is possible to catalog
all the conventions that are specific to a book;
then this dictionary will be further input to
the parsing algorithm.
Not all such conventions can be predicted; an author is relatively
free to invent new conventions for notation and draw upon additional
character sets.}
\section{Outline of our parsing algorithm}
\subsection{Input}
For purposes of exposition, we assume that a page of glyphs and locations
has been presented to us as a set of 5-tuples. Each 5-tuple
represents (glyph-name, x-origin, y-origin, width, height).
The glyph-name may be a simple character ``1'' or ``integralsign''
or may be a compound object itself including font and pointsize.
The x-y origin data represents the lower left corner of the bounding
box for the character with (0,0) representing the left bottom margin.
The width and height are specified in pixels, at 400 dots per inch
resolution. The ordering of the scan is generally irrelevant. Some
characters may be delivered in pieces. Thus an ``='' may be
provided as two short ``hline'' (horizontal lines).
The example from our figure is delivered (approximately) as
\begin{verbatim}
(integral 0 1 51 120)
(x 52 7 30 29)
(HLINE 52 54 207 12)
(x 82 79 27 27)
(p 83 22 19 26)
(q 113 93 18 26)
(HLINE 115 20 52 8)
(HLINE 145 92 52 7)
(x 177 12 27 28)
(HLINE 203 39 29 5)
(1 209 83 20 40)
(p 232 25 23 26)
(d 298 85 30 38)
(HLINE 300 59 62 7)
(x 312 14 29 27)
(x 333 85 27 27)
(HLINE 375 57 54 7) ; bottom of =
(HLINE 375 68 54 7)
(2 446 15 26 40)
(HLINE 448 61 61 6)
(pi 463 86 34 26)
(P 480 1 32 43)
(T 528 54 36 42)
(a 557 54 24 26)
(n 583 54 26 27)
(g 605 41 29 39)
(q 652 73 25 41)
(2 653 17 24 39)
(HLINE 658 62 62 5)
(pi 688 87 32 27)
(p 686 2 32 42))
\end(verbatim}
\subsection{Output}
For each equation successfully parsed, we expect a lisp-like
parenthesized expression and a bounding box. The expression
will not in general be entirely simplified algebraically. We expect
substantial further processing will be
necessary to make it suitable for storage in a computer
algebra system as a pattern-replacement pair to be retrieved
on request.
\subsection{Processing}
There are a number of engineering decisions that provide
alternative paths to the end result.
For our processing needs, we can divide the task into linearization
followed by string parsing. (This same tack was adopted by Lewis \cite{lewis}.)
That is we convert
an expression into a linearized string, which then needs further
processing to interpret as an algebraic expression. Such a linearization
converts $\int x^2 \sin ^2 x d x$ into the ``moral
equivalent'' of \verb| int@(x^(2))@(sin^(2))@x@d@x|
where \verb|@| signals no more than horizontal concatenation. The
more sophisticated interpretation necessary to determine
that the second space is ``multiplication'' but the first and third are
``function application'' and the fourth is
some kind of operator. This results in the ``moral equivalent''
of \verb| integral(x^2*(Sin[x])^2,x)|.
\subsubsection{Sorting}
First, assure that the characters as recognized are sorted by x-origin
(minimum) coordinates, then y. (Our OCR program provides this as a default.)
\subsubsection{Lexical}
Collect adjacent digits and names to form numbers (syntactic
category {\em number}), and symbols
such as sin, cos, tan, log, exp (category {\em name}). We
can distinguish horizontal lines in five categories:
quotient bar, minus sign, top of square (or nth) root, and
part of $=$, $\ge$, $\le$.
We also collect comma + dot as ``;'' etc. if the OCR phase has not
already done so.
Other items may be single-letter names, if alphabetic;
other symbols such as parentheses, are their own categories.
The bounding boxes for collected objects are adjusted as needed.
For example the {\em Tang} token in our sample input now looks like
{\tt((T a n g) 528 41 106 55)}
(note: how to distinguish a minus sign from a quotient bar? Construct
a square bisected by the horizontal line. If there is a character
contained in the upper half, check to see if it is also a horizontal
line (an = sign), a $>$ (an $\ge$) etc. If none of these situations hold
and there
is also a character
in the lower half, then it is a quotient bar. A square-root bar
is identified by the presence of the rest of the surd mark at its
lower left. If there are no characters above or below, it is a minus sign.
\subsubsection{Segmentation}
This phase separates equations from each other and from the page decorations.
We can use local clues such as recognizing that
each equation has an integral sign on the left, at least one ``='' and
possibly a continuation line or lines. For
continuation lines we generate must ``attach'' the continuation
to the right side of the previous line.
\subsubsection{Heuristics}
Optional: match parentheses, square brackets, braces, by size and position.
Raise objections to unusual symbols or unmatched pairs.
\subsubsection{Left-to-Right scan}
Collect, in a left-to-right scan, say $x^2+{a_1}^2 + {r \over s}$, the form:
\verb|x^(2)@+@(a_1)^(2)@+@ (r)/(s)|.
Heuristics for what constitutes adjacent and what constitutes
superscript may be in part based on base-line characteristics.
Notice that in the form $x^p$ the exponent is nearly adjacent to the base,
while the form $p^x$ provides a much larger disparity. This disparity
is negligible if the base-lines of the characters
rather than their geometric centers are used.
\section{Performance}
The current process of scanning an 8.5 by 11 inch
page at 400 DPI to produce a TIF file on a 486-50 Mhz based
PC takes about 20 seconds using Caere Omnipage
software. Converting the resulting TIF file to a PBM format
file and then reading that into a Lisp system takes about 40 seconds.
We do not have timing figures on the more accurate
character recognizer now under development
\cite{benb}, but a more primitive one used for developing the test case
above can identify about 4--6 characters per second (running on a Sparc-1+
workstation).
We do not yet have figures for parsing speed on a full-fledged
grammar.
((more))
\section{Conclusions}
It is too early to judge how
effect this parser or one based on more
general syntax-driven principles will fare
in the face of noisy input and highly variable expressions.
In particular, for dealing with noise, it may be
necessary to consider Chou's techniques \cite{chou92} as
a fall-back position based on
stochastic parsing and probabilistic recognition
of equations.
Further testing will, we hope, demonstrate a substantial level of success
with these simpler techniques, making the fall-back position
generally unnecessary.
\section{Acknowledgments}
We thank D. Halpern and D. Arnon for useful discussions.
\begin{thebibliography}{99}
\bibitem{anderson} R.~H.~Anderson,
{\em Syntax-Directed Recognition of Hand-Printed
Two-Dimensional Mathematics},
Ph.D dissertation Harvard Univ.,Cambridge, MA, Jan. 1968. Also
(shorter version) in M.~Klerer and J.~Reinfelds (eds).
{\em Interactive Systems for Experimental Applied Mathematics},
Acad. Press, 1968. 436--459.
\bibitem{benb}
B.~P.~Berman and R.~J.~Fateman,
``Optical Character Recognition for Typset Mathematics,''
submitted to ISSAC-94.
\bibitem{dehaan} Bierens de Haan.
Nouvelles Tables D'Int\'egrales D\'efinies.
Edition of 1867 -- corrected.
G.E. Stechert, NY 1939.
\bibitem{chou89} P. Chou, ``Recognition of equations using a
two-dimensional stochastic context-free grammar'',
{\em SPIE Conf. on Visual Communications and Image Processing},
Philadelphia, PA, Nov. 1989.
\bibitem{chou92} P. Chou and G. Kopec, ``A stochastic attribute grammar
model of document production and its use in document recognition. ''
{\em First Intl. Workshop on Principles of Document Processing},
Washington, DC, Oct. 21--23, 1992.
\bibitem{GR} I.~S.~Gradshteyn and I.~M.~Ryzhik,
{\em Table of Integrals, Series, and Products},
Academic Press, Inc. 1980.
\bibitem{lewis}
H.~R.~Lewis.
``Two Applications of Hand-Printed Two-Dimensional Computer Input,''
Senior Honors thesis,
Committee on Applied Mathematics, Harvard College. (also, Proj. TACT Rpt. 2)
77p. 1968.
{\bibitem{martin} W.~A.~Martin, {\em Computer Input/Output
of Mathematical Expressions}, Ph.D dissertation,
M.I.T. EECS Dep't, 1967.}
\bibitem{wells}
Mark B.~Wells and James B.~Morris (eds).
{\em Proceedings of a Symposium on Two-Dimensional
Man-Machine Communication},
SIGPLAN Notices 7, No.~10, October, 1972 (ACM).
\end{thebibliography}
\end{document}