We provide links to documents related to electronic representation of scientific documents, and specifically those that contain mathematics, in a form that is compatible with further use.

This is a supplement to a talk presented at ICDAR 97 (Ulm, Germany).

- Publishers of journals, texts, reference works, and organizations that must provide archival storage and retrieval. Examples: American Physical Society, Academic Press, JSTOR.
- Software publishers for OCR/ document analysis, document formatting. Examples: Xerox, Caere, Microsoft.
- Software publishers whose products access ``contents semantics'' from documents, including library keyword search programs, natural language search programs, data-base systems, visual presentation systems, mathematical computation systems, etc. Examples: INSPEC, hypertext systems, Maple, Mathematica, Matlab, MathCAD.
- Institutions maintaining access to electronic libraries, which must be broadly construed to include data and programs of all sorts. Examples: Universities, Computer centers.
- Individuals and their programs acting as their agents who need to use these libraries, to identify, locate, and retrieve relevant documents.

We need to have a convergence in design and standards for encoding new or pre-existing (typically paper-based) documents in order to most efficiently meet the needs of all these groups. Various efforts, some loosely coordinated, but just as often competing, are trying to set standards and build tools. We approach this by first dividing the task into visual, syntactic, and semantic components. These specifications can focus attention on the addressing requirements of the different groups. Additionally, these components rely on a structure for documents that incorporates existing library models and extends them to new modes of operation. Berkeley's MVD is one plausible structure which will allow prototype development and explorations.

Here are some references as promised in the ICDAR paper.

First we draw your attention to an interesting position paper regarding the role of libraries and publishers in the dissemination of scientific knowledge in the future: Electronic Publishing in Science: Winners and Losers in the Global Research Village in which P. Ginsparg of Los Alamos discusses the past, present and future roles of electronic documents in the sciences. Reproducing paper-like documents electronically is a small part of the picture. In fact, the whole production and distribution mechanism must be re-examined. Other papers in this proceedings of this 1996 ICSU/UNESCO conference may also be of interest. A recent survey of the new distribution mechanisms for preprints and associated technology is available in A May 18, 1998 report in Chemical and Engineering News, Electronic Publications.

Since our ICDAR paper concentrated on mathematics representation, we will, for the rest of this report concentrate on mathematics.

If you look up only one place on the internet, an approach that we think is fundamentally right is the SIMLAB project at Cornell using a semantic and structural approach for mathematics, designed for collaborative engineering and design work. The treatment of the context and overall viewpoint of communication known as Mathbus is described in this paper on collaborative mathematics environments. However, no one has a ``lock'' on the exactly right approach and this is not so different from others of the other contemporary projects that they would be incompatible.

Next, a diversion. If you buy into the rigor argument (namely that we can only communicate that which we formally represent -- and this is not a bad argument when it comes to mathematics), there are many WWW resources to look at before we can communicate. For example, on formal representations of many concepts you could check over the listings being assembled by R.B. Jones including this report on mathematics.

Or perhaps you should start with QED, an "international project to build a computer system that effectively represents much of important mathematical knowledge and technique. The QED system will conform to the highest standards of mathematical rigor, including the use of strict formality in the internal representation of knowledge and the use of mechanical methods to check proofs of the correctness of all entries in the system. A principal application of the QED system will be the verification of computer programs. For background on the idea of the QED Project see The QED Manifesto."

If we deliberately ignore this noble line of work and instead work from the (in general false) hypothesis that the semantics for a piece mathematics is exactly its appearance, we can make some immediate short-term progress. Just take pictures of all the pages and ignore the content. Perhaps with indexing by subject or title (done by hand, perhaps), we can find what we want. It appears that the math book collection at cornell does approximately this. It appears it is slow to access, download, and without content-based search. (It has been unavailable since 1/26/97, and I have not tried it myself). On the positive side, if the images are good enough, anyone with a cleverer idea for analysis or indexing can use this as a basis for further work.

If we rise above this level of bitmap, but view the
QED approach as too difficult, what can we do? Historically,
an early paper that sets down what had by that time become
fairly obvious. Namely that with this hypothesis, *to
communicate mathematics electronically in some standard way that could
be understood by all, * we just need to agree on a single
typesetting system. There are already a number of typesetting
systems for mathematics (among academics, TeX and EQN were prominent),
Pick one. This was conveyed in a 1987 memo from the Network Working
Group (presumably ARPA funded in some way)
Issues in Defining an Equations Representation Standard
by Alan Katz
of USC/ISI. Katz comes up with a recommendation for an EQN
ASCII "directive" system which allows simple inclusion of math-display
objects in email etc.
Actually, I am mischaracterizing Katz: in fact
he argues that semantics would * not be represented because
the person typing the math may not even know what it means.*
This situation may have changed somewhat since 1987, with more scientists
typesetting their own papers!

Another early paper is "Towards an Equation Exchange Standard" Task Group Agenda by Tom Libert, Michigan EXPRES Project (libert@citi.umich.edu (1987)). Libert proposes the task force address the question of making a ``content architecture'' versus just equation formatting, and should the group ``develop such an architecture, or simply wait for ANSI/ISO to do it in 2-3 years?'' Clearly the wait has not achieved the architecture, although arguably several other groups have tried to pick up the ball.

A subsequent public meeting at a SIGGRAPH conference concerning this topic of representation of equations resulted in a report by Dennis Arnon of Xerox, Report of the Workshop on Environments for Computational Mathematics The key idea being proposed here is that an ``abstract syntax'' be defined for any mathematical expression, roughly corresponding to a functional notation (or equivalent). Using trhis, a SMS or Standard Mathematical System, consisting of a collection of SMCs or Standard Mathematical Components communicating via the abstract syntax run over a ``wire'' between them Typical SMCs could be assembled for use in one or more of four ways: Editing, Displaying, Computation, and Documentation.

This proposal is a useful clarification, but does not go far enough in elucidating how separate meanings could be attributed to the same abstract syntax in different contexts.

A more or less linear extrapolation from this position, but with the injection of SGML notation, has been incorporated in the stream of drafts of the WWW consortium MathML Mathematical Markup Language, an XML (Extensible Markup Language) application for describing mathematical expression structure and content. The most recent draft (10 July, 1997) is rather large, but in my view is a reasonable document. It is still undergoing revision by a small group of apparently sensible individuals. I believe this WWW consortium activity originated from the realization that it would be good to add visual math markup to HTML. As of the summer of 1997, it appears that the SGML Math Workshop Group, (apparently now password protected) and the activity at WWW Consortium are about the same. (HTML-Math goals seemed to differ from SGML math only in that it is explicitly stated as a goal that it is easy to learn and edit by hand!). One of the participants informs me that this goal has not changed, and that "We were not able to agree on a simple notation by the May [1997] deadline, so we put out what we could agree on. We are now shooting for May, 1998 for a short input notation."

The current draft allows the intermixing of position
(presentation) notations and content notations, as well as a
"semantics" tag that allows for an apparently arbitrary piece
of stuff to be attached to (what amounts to) a node in a tree.
For example, the presentation of an integral with limits
may be presented as an integral sign with
sub/super-scripts and
embedded *dx* somewhere. By contrast, the semantics for
the same object
may describe lower/upper- limits of integration and
may explicitly mark the variable of integration "x".

If one agrees on a compact notation, one might wonder why SGML syntax is still needed: for what it is worth, there are applications that verify and process SGML syntax, and that conformance to the verbose version is required. Two vendors of programs mentioned below (WebEQ and TechExplorer) say they will be able to read and write MathML, and more are expected, including EZMath, Amaya, and products from Design Science (Mathtype), Waterloo Maple, and Wolfram Research (Mathematica). You can find an update on the status of programs that implement some components of mathml. which is a nice discussion and a defense of TTH (Tex to HTML) by its author, Ian Hutchinson. This is a program to render TeX into html, including mathematics. It does a reasonably good job, and uses html itself. I found it worked rather well, at least on a sufficiently up-to-date browser. IT works fine on Netscape 4.6 on Windows NT without any downloading of Java (WebEQ does this) or the use of a plugin (MathExplorer requires this), or the replacement of your whole damned browser ( which is what Amaya frow W3C requires.). The prospect of everyone using Amaya is pretty slim, but it works nicely for math, allowing one to see typeset material using CSSL, and it allows for coordination, moving into and around structures defined in MathML using XML. It totally ignores the representation of semantics, which is intellectually the far more interesting part of the job, and deals solely with the the presentation. That is, it takes the part that is solved by TeX and other typesetting specifications, solves it once again in a particularly long-winded and clumsy way, and (at least to my thinking) violates the SGML spirit (namely that the SGML should represent the "meaning" or at least the structure of a document, and allow the viewer to impose a display representation). Here the XML dictates exactly the display, including font sizes, etc. Given that, Amaya still is kind of interesting in that it offers simultaneously a tree version and a displayed version of an expression, and the data can be selected and edited by pointing or swiping with a mouse in the window associated with either view. This is neat, even though it misses the major MathML point. Presumably it would be exhausting for W3C and its staff and volunteers to try to keep up with the Netscape or Microsoft browsers (and their installed base), as well as Java, which is not included at all. Amaya should be considered a proof of concept so as to insinuate its features into every browser. (Amaya is also an editor that can be used to create web pages.)

Even within the SGML framework, one can consider various levels of verbosity (I am grateful to Neil Soiffer for offering these examples), depending on how much information you encode in the leaves (CDATA) of the SGML tree, and how much you encode in the SGML structure itself. Consider the expression 2x+y, which might be encoded as

[MATH] [MROW] [MROW] [MN] 2 [/MN] [MO] [/MO] [MI] x [/MI] [/MROW] [MO] + [/MO] [MI] y [/MI] [/MROW] [/MATH]Where MI means Math Identifier, MO means Math Operator, and we have used [] to enclose the SGML markers. By contrast, extended MathML with a parser that knows some precedence rules could have a simpler SGML:

[MATH] 2 x + y [/MATH]The tradeoff is that the first version might allow standard SGML to validate, search, index (or whatever) the math expressions. The second version requires some external parser that knows about precedence (etc), but is easier for a human to comprehend and write, and simpler to transmit (and probably to store). There are intermediate choices as well.

More chatting up of alternatives to this kind of encoding are provided in the discussion in minutes of a 1996 SGML Math workshop held at the University of Illinois, as part of its digital library project on ``Federating Repositories of Scientific Literature.'' Perhaps all of this will converge with the OpenMath group's work, which I see as following the notion that mathematics must be made formal and explicit but is based not on axiomatic set theory and logic. Rather, it is to be built on some combination of rigorous "classic" modern algebra, some applied analysis, and a strong dose of programming language design. The OpenMath working group is now working on ``Content Dictionaries'' expressed in a version of SGML. You can view them by following links from the home page. The success of this effort is predicated on the cooperation (and funding) of participants interested in a grand programme of automated mathematics. Some preliminary and anticipatory work in relation to OpenMath is is the work at the polymath group of Simon Fraser University, who have put together a library and a collection of "Java beans" implementing draft OpenMath communications standards. The North American Open Math Initiative is an effort to produce and promulgate this standard, involving IBM, Maple, and a variety of other organizations including a number of Western Canadian institutions. To the extent that it attempts to provide some rigor, the OpenMath effort is slightly reminiscent of the QED Manifesto. Yet it appears to be far more pragmatic in being driven in by needs of available computer algebra/ plotting/ network display programs, as well as the hope that Java, in some kind of virtual networked world will provide solutions for portable scientific communication. This and the polymath project are actively growing and their pages are worth browsing, especially if you are willing to run their Java applets.

In any case, there is an overlapping cast of characters for all the groups mentioned in the previous paragraph; most are listed at the previously noted site HTML-Math.

There are computer algebra systems with notebook-like front each, each of which would presumably be delighted to see its own technology adopted for the representation and computation of mathematics. Even the most ambitious of these today would have to be extended in semantic scope before being using for computing in all the areas that can be (easily) typeset. Here are a few systems with Notebooks: Mathematica, Maple and Mathview, Macsyma, Axiom which is supposed to link nicely with IBM's TechExplorer and MuPAD, a research project at Paderborn University. There are also broad interests in computer algebra and display at Project SAFIR at INRIA (France). These and other computer algebra systems or projects are mentioned with links in the SymbolicNet listing of systems. An ambitious project that began as a communication protocol design for distributed scientific computing, the MP or Multi Protocol at Kent State is designed to support efficient communication of mathematical data between scientifically-oriented software packages. MP exchanges data in the form of linearized annotated syntax trees. At a level above the data exchange protocol, dictionaries provide definitions for operators, symbolic constants, and annotations. Binary encodings are used.

There are a variety of one-off projects like the MINSE project, with a "polymediator" which is a computer program available to convert, on request, pieces of readable text into urls of stuff. Or the Euromath project funded by the European Mathematical Trust, to ``enhance research support for European mathematicians'' by assisting in communication, apparently resulting in an editing system for mathematical publishing, and a bulletin.

If your primary interest is simply to display math on Web pages, then the Minnesota Geometry center would like you to use their WebEQ software which requires the user of a browser to download programs (UNIX or Windows) but would then process HTML-Math and other, TeX-like stuff. Poliplus software's EqnViewer comes with a java applet that can be used to display pieces of math. Since their demo page hasn't worked reliably for me, I haven't been tempted.

TCI Software Research, purchased by MacKichan, has a selection of word processing programs including Scientific WorkPlace, which is a word processing system designed especially for preparing technical documents, with substantial support for mathematics. Options for this product originally included links built in to the Maple or Mathematica computer algebra systems, although now it appears that only the option of Maple is supported. SWP represents an alternative route to the ``notebook'' paradigm that seems to be a result of convergent evolution in the Maple/Mathematica (etc) battle. In this case the evolution has come from an almost-WYSIWYG almost-TeX-intermediate-form word processor. There are other products described in links from its home page.

A nicely working program, which was first available for windows, but is now being distributed on various UNIX systems (9/1999) is IBM's TechExplorer which deals with TeX code and must be down-loaded to your computer as a plug-in. Simon Fraser University has interesting Web-based mathematics presentation and computation facilities in their Organic Mathematics Project. This project demonstrates alternatives to the notion of a static, even hyperlinked, textbook document. People interested in a more multi-media approach to math should find this interesting.

How about publishers? Consider Lightbinders, a producer of CD-ROMS for scientific documents, including the new 5th edition of I.S. Gradshteyn and I.M. Ryzhik's Table of Integrals, Series, and Products (Academic Press). Naturally. a form of encoding was needed for this. Peter Goldie, the founder of this company, explains why SGML is better than PDF, and includes numerous useful links to other items. Basically, SGML, on which Lightbinders' technology is based, is not merely appearance based, as is PDF, but structurally based. This argument carries weight with me. Unfortunately, the SGML used for this book encodes, through embedded TeX, only the appearance of the mathematics! The browser/ front end to this book and other items suitably encoded in SGML is Dynatext from EBT (now INSO), which allows either SGML or TeX format and is compatible with ISO TR 9573 part 11 DTD, the AAP DTD, and the ArborText DTD. In fact, all forms are converted to TeX, then DVI and then rendered as raster images. The advantage of starting with SGML is that search tools that have been adapted for other SGML forms might be able to search in SGML-math too. Searching in TeX is not supported.

Among the companies bridging the content-production tools and publishing industries we also include Xyvision, an electronics based publishing company used by some scientific journals, and ArborText, which provides SGML-based editing and publishing software. Another company in the marketplace supplying electronic publishing for the scientific and technical markets is Aztec Corporation. Their concept is to provide platform-independent CD-ROM based programs using their Intelligent Manuscript Architecture combining formatting, search, and problem-solving. A sample (on the Web) handbook of differential equations is on exhibit.

A recent note to the www-math@w3.org mailing list from Z. Fiedorowicz points out that the UMI Dissertation Abstracts Service has started to "translate" mathematics dissertation abstracts from TeX into some very weird markup language on their web pages. Neither Amaya nor Netscape 4.6 seems to work on it.

A response from Nico Poppelier is that this seems to be some AAP math fragment of SGML. Moreover, it appears that at least some such translations are buggy, having been done by hand, perhaps in TeX by the authors, but perhaps not.

After looking through this material, you might wish to peruse Roadkill on the Electronic Highway: the Threat to the Mathematical Literature for an argument that a distinction must be made between the more casual electronic publication and ``real'' publication.

How about joint projects between publishers and non-profit or educational institutions? The JSTOR project is a not-for-profit organization established with funding from the Mellon Foundation that provides a significant collection of on-line journals made available to participants (universities) under a subscription concept that is intended to satisfy for-profit publishers and users of this material. It is intended to be primarily an archival depository, leaving the most recent published material to be accessed through other means. JSTOR uses a variety of proprietary software that can, however, be downloaded to user systems. The primary distribution mechanism provides printable or viewable page images. In some ways a similar operation, HighWire Press, a unit of the Stanford University Libraries, has set up partnerships with primarily non-profit scientific societies. Here the intent is to provide a mechanism for on-line publication of traditional publications, up to the current issue (and into the future, as contents listings are available). They are explicitly concerned with conversion to a networked library of the future. The documents appear to be searchable PDF.

The issue of communicating ``objects'' of any stripe is not unique to mathematics, and is simply a special case of what many people believe computers should routinely support. Indeed the flexibility of computer networking should allow us to connect almost any digital information source to any information sink.

As a non-mathematical example,
a human user interacts with a graphical user interface (on a client computer) to
get access to a data-base server such as a library catalog system.
There is a necessary intermediary function here:
the packaging up of a query for delivery, as well
as the packaging of the response.
Not all sources and sinks have the same data interface,
and there is a lot that can go wrong with the
middle. For example, some particular server may be busy or
out of operation, but maybe there is an alternative.
Intermediation is the realm of a new software category of
*Middleware tools*. The vendors of middleware tools claim to assist
one in packaging up any old thing and deliver it anywhere else.
For greatest utility, it has to be done in a
machine independent fashion. Component technology for linking
machines can be based on any of a number of standards, and
CORBA
(Common Object Request Broker Architecture) is
one worth mentioning.
In fact, if we compare the objectives of CORBA (which also has
task forces, such as financial, multi-media, manufacturing, etc)
we find material like this:

The Manufacturing Domain Task Force Mission:

- To foster the emergence of cost effective, timely, commercially available and interoperable manufacturing domain software components through CORBA technology
- Recommend technology for adoption that enables the interoperability and modularity of CORBA-based manufacturing domain software components
- Encourage the development and use of CORBA-based manufacturing domain software components, thereby growing the object technology market
- Leverage existing OMG specifications
- Recommend liaison with other appropriate organizations in support of the preceding goals.

If we substitute (for example) OpenMath for CORBA and OMG, and mathematics for manufacturing, we have a similar story.

A currently non-CORBA approach to the same problem which is based on Java, is RMI or Remote Method Invocation, which provides a plethora of facilities intended to make Java-encoded data (presumably including programs) usable and transportable between platforms. Conveying structured data described as elements of a class is intended to be entirely transparent between the sender and the recipient. If that data consists of ``math'' objects, then maybe we have some software technology waiting for our use. This assumes that we all agree on the form of such an object (and that kind of standardization is most of what we are concerned with above). RMI also assume we are either using Java or adopting a solution that is Java compatible. Furthermore, it makes the reasonable assumption that we are dealing with objects that can be serialized. This means that they can be written out as a serial byte stream which can be read back into another virtual machine running Java as instructions to create the "same" object. (Common Lisp does this free with its print/read programs, for example). If we do not agree on an object description, translating among variants would be needed. If we want to search for translation software, we might find the software vendors who providing component exchange technology such as Lotus's InfoBus. This kind of technology -- embedding the translation process between the front and back ends of programs -- seems particularly unappealing compared to abiding by a standard. Nevertheless, in the absence of agreement, this approach of translatation is another position of cooperation among diverse programs.

The US Patent and Trademark Office (USPTO) among other government agencies has an interest in representing complex documents, and you can look at a Concept of Operations for the Distributed Object Computation Testbed (DOCT) sponsored by DARPA and USPTO. Among the "key DOCT benefits".. "Capabilities include the ability to support information in a wide variety of formats (such as ASCII text, VRML, CGM, JPEG, MPEG, WAV, chemical expressions, mathematical equations, and biological sequences)." The work by SAIC in developing recognition of CWU (complex work units) consisting of mathematics is particularized to 300 dpi monochrome, clean, (unbroken, non-touching characters) accurate, minimally skewed images. The OCR is trained on about 350 "symbols" where some of these symbols are actually parts of a multiply-connected symbol. The OCR provides alternative recognition results in case of uncertainly (An aside: after some exploration, SAIC determined that neural network based approaches were unsuitable, and so a typical ad-hoc pattern matching approach was used to recognize symbols.) The results are tree-like, with some recognition of the most common expressions types (e.g. quotients). Further information has appeared in various similar memos on-line at SDSC. See for example the June, 1997 research report on Task A2/A3 Automated SGML Tagging. It is not clear if this is the final report or an intermediate progress report.

Chronological Update.

(July, 1999) It appears that many people are hoping that XML will serve as a common encoding for mathematics, along with everything else. OpenMath continues to meet. math.w3.org continues to discuss. We learn that the official reason that UMI doesn't use TeX is that it is not Y2K compliant (huh?).

Here's the current relationship between XML, MathML, OM: (many thanks for David Carlisle of NAG for this:)

OM has an abstract model of the tree representing the mathematical object, and then various encodings of that abstract tree. Currently two encodings are supported, a `binary' one and an XML one. (Previously, there were lisp-ish and SGML encodings, but they have been dropped.)

The XML encoding is the most visible and important encoding, it allows OM objects to live inside the ever growing world of XML documents. There is a certain amount of tension as to how many `XML features' one should allow, that is, whether it is enough to ensure that the XML encoding of OM is in fact valid XML (which it is) or whether you should mandate that every OM application needs to be a full XML application or whether it only needs understand the subset of XML used in the XML encoding of OpenMath.

Either way, once your mathematics is living in an XML document it is then amenable to being manipulated via APIs such as the W3C DOM (eg from Java). This should, on a good day, provide for moving mathematics between interactive documents living in web browsers, between computer algebra systems, and anything else that cares to play the game.

How does MathML and OpenMath compare? MathML is the other XML application that is in this mathematics area. There has been a lot of work over the last year (1998-99) to make sure that these two languages are aligned in complementary ways, with Content MathML being essentially a shorthand for OM that covers a fixed range of Mathematics (essentially up to first year university, or end of high school, depending on culture) and `presentation' MathML being the son-of-TeX that provides a mechanism by which OpenMath and Content MathML can specify a visual presentation form for the object. This coordination between MathML and OpenMath is helped by the fact that the project memberships overlap.

According to Carlisle, ``OM has been kicking around for a few years now as a potentially good idea, but with no real momentum behind its use. I'd say that XML and DOM style interfaces giving active documents via Java and similar languages are what makes OpenMath _more_ valid today and with a real chance of being accepted by a wider community. While it's all an academic exercise people can argue for hours, or even years on the benefits of one encoding over another, but once systems are in place that you can have your mathematics inline in a web browser and can interact and modify it from Java applets, or cut and paste it into Maple, Mathematica etc, then people won't care so much about the encoding, so long as it works.''

November, 2003. There have been continuing efforts to galvanize a range of organizations to converge on encodings, clear copyrights, ``do the work'' of scanning and correction, creating meta-data, and figure out archival repositories. For example, the American Mathematics Society (AMS) published a position paper by John Ewing in March 2002, Twenty Centuries of Mathematics, discussing the situation, and Cornell University has been funded by the NSF to coordinate a project on the topic of Digital Mathematical Libraries.

After 6 years we still do not see much convergence between the advocates of advanced representations of mathematics (current examples being MathML, OpenMath), and what the publishers seem to understand: processing journals to obtain images, but with some efforts for collecting meta-data. (Some parties seem to be ill-equipped yet still eager to fight battles about the superiority of PDF, DejaVu, tiff, or whatever they happen to be doing. The point they may miss is that none of the math content is encoded in any of these forms.)

Click here to go back to my home page.

*
mail comments to Richard Fateman, fateman@cs.berkeley.edu
*

last revised November, 2003. RJF