Appendix:sample texts

page35 is page 35 of ``HAKMEM'' (MIT Artificial Intelligence Lab memo 239), February 1972 Hakmem is a collection of random programs, data, problems. Most of this memo, and all of the page 35 appears to have been printed with a Courier typeface Selectric (tm) typeball, with a few math and superscript symbols from a Symbol typeball. Page35 suffers from being reproduced by offset, and has minor defects like staple holes. This was scanned at 600 dpi and then reduced to 300 dpi in order to try commercial OCR programs on it. It has a skew of about 0.55 degrees. We used the 300 dpi version in which it is 2548 by 3300 pixels. Commercial OCR recognition of this page is fairly successful.

6.tif is from a double-page (pages 336-337) of a table of integrals by Prudnikov, Brichkov, and Marichev: Integrals and Series, volume 1. printed in 1983 by the USSR government printing office in Moscow on low-quality paper. It was copied on a Kodak xerographic copier once to make it easier for us to scan mechanically. The right-half part (page 337) is skewed by about 1.5 degrees. We did our experiments on the left side, after cropping out some edge defects. This document was scanned at 600dpi. It is 3408 by 5604 bits. Commercial OCR of this page results in no useful information.

form001.tif is from a double-page (pages 254-255) of a table of integrals by Gradzhteyn and Rhyzik, specifically entries 3.161.3-6 on page 255. Although this was printed bgy Academic Press, it appears to have been photo-copied from the original Russian version. It was copied on a Kodak xerographic copier once to make it easier for us to scan mechanically. The right-half part (page 255) was deskewed by our software. We did our experiments This document was scanned at 600dpi. It is 1831 by 1524 pixels. Commercial OCR of this page results in no useful information.

``Image EMACS'' given a page of pictures: operations: cursor character motions: F B P N (forward back previous-line next-line) word motions F B line motions A E delete character: D rubout delete word delete line

learn (characters) search fill-paragraph

display in artificial font copy yank insert typed characters? OCR ?? More anecdotes form001.tif, a file produced by scanning pages 254-255 of Gradshteyn at 400 dpi produced a picture of width 4384, height 3392. Time to read in, 2.01CPU+.2GC; Time to compute connected components 1.5CPU+3GC; 3332 found. After filtering out spots of area <4, the number of components is 2403.

Time to convert and display (bitblt to window) a 1/4 scale version of the 2403 characters: 8.4CPU+.44GC seconds.

This double-page spread was copied by a xerographic copier, and then run through the scanner. The two pages look (visually) like the left one (254) is approximately straight, but the right one is skewed. Presumably the right way to deskew is to separate the two pieces like this.. Separating out the left part and right part: (setf left (con-pict pict1 :width 2150)) ;from x= 0 only up to middle (setf right (con-pict bf1 :left 2151)) ; start at middle (setf left1 (dfilter-out-noise 5 right)) ;for example, to clean up some bits (setf right1 (dfilter-out-noise 5 right))

Reassemble the pieces together (setf left1t (manypicts21 left1))

The left part appears to be skewed at -0.34 degrees. The right part appears to be skewed at 1.27 degrees {Characteristics of pictures

If we have identified a pile of connected components, how do we tell what they are? One way to start is to see what the most common size of an enclosing box is: If we find that (statistical) mode of the width of a box is a reliable statistic, we can use that as an estimate of the width of a character. For hakmem page35, the mode of the width is 22, the height is 23.

For the noisy 6.tif, the mode is 1 bit wide, though after filtering noise out, it becomes 52 bits wide.

Comments on re-use

We suspect that research programs are usually not accessible or re-usable outside their own institutions because they are still under development, are insufficiently general, perhaps too slow, or perhaps awaiting a route to profitable technology transfer.

Having developed and refined a set of programs that we feel are efficient yet general enough for others to use, we hope to attract others to look at our source code, refine and augment the facilities, and provide feedback.

Next: match-up strategies Up: A Suite of Lisp Previous: Acknowledgements

Class Account
Fri Dec 1 14:31:16 PST 1995