match-up strategies



next up previous
Next: About this document Up: A Suite of Lisp Previous: Appendix:sample texts

match-up strategies

One annoying barrier to the collection of characters on a page is that the major heuristic used is guaranteed to fail on some characters, and likely to fail on less-than-perfect images. This major heuristic is that a character has a 1:1 correspondence with a connected component of the ``graph'' of the image. This is fine for a character like S but clearly fails for i, j, = and punctuation including :";!? Additionally, often characters are disconnected by accident; e.g. it is common to see the separation of a leg on an n, or an accidental connection. A very typical one is that a sequence rn is connected to look rather like an m.

How to fix this up?

First, consider a fixed width font such as Courier (Normal) If we draw boundary boxes around the characters' connected components , we see that they are in fact not all the same width, and that I is narrower than M. What is fixed, is the character spacing. That is, the space from the origin of one character to its right neighbor's origin is fixed, and that there is a imaginary rectangle that is the character width and a height that includes enough space for the tallest character as well as the one with the deepest descender (e.g. p, y, or for that matter, parentheses, brackets and other characters.) In some fonts, the characters do not strictly adhere to the rules: some characters actually intrude into neighboring boxes just a little. Italic fonts do this quite routinely.

Strategy 1: To join the dot to the i, and the pieces of broken letters together.

Divide the page up into horizontal ``swaths''. These generally correspond to lines, but in the case of tightly spaced lines, we may find no breaks, and end up with whole paragraphs. Often these swaths can be found accurately by deskewing and looking for large cuts, but our method should not be so sensitive to failure that it matters if we end up with two or more lines in one swath.

Next, Determine heuristically, the size of the characters. For a line typeset in all one size, or mostly one size, the median left-edge to left-edge distance of adjacent characters provides the character width. This need not be exact. The height can be determined by the maximum height of characters, after eliminating some percentage of outliers: merged characters, vertical rules, integral signs. One can also use as a guideline, the character width and height of the previous line(s).

Our intent is to divide each horizontal swath into horizontal regions, each a character width wide, and as high as the swath is high. We scan left to right in the regions, picking out (a) connected components that are so completely clearly a letter that they are easily recognized as such. Remove them. (b) scraps that are not recognized. For these, we take their bounding box, with (x,y) origin at lower left. extend a character-width box to the right, and a 50% character-height box up and down. Collect all pieces that are entirely or partly within this box. (This allows us to connect together a W whose upper-left leg is the first encountered piece separated from the rest of the body, as well as an A whose lower-left leg is separated.) An alternative is to determine the baseline heuristically, and then use the x-coordinate of the scrap plus the presumed character position.

Next, create a minimum bounding box around this new object. Compare it to the heuristically determined character sizes: if this is too large, consider chopping it to the right size. If it passes muster, treat it as a connected component for purposes of removal from the page image, clustering, recognition, etc.

Extension to variable-width fonts: repeat the exercise looking up first the widest possible explanation of the character-width, and then narrowing down so that as many pieces are explained as possible, with the fewest explanations.

This will not work well if character pieces are first happily explained away, e.g. a letter I is recognized and removed; afterward a scrap that looks like H with its right leg missing must be explained.. It should have been joined to the ``I''.

This will not work well for characters whose left-most piece and/or right-most piece just missing. A W with a missing left leg, resembling perhaps an italic N, will not be aligned right. And if both left and right pieces are missing, for example a W which would resemble an A without a cross-bar, we will also going to have problems. I think we're going to miss these.

Strategy 2: Sliding matchups. Very slow, it seems to me..

{further discussion of deskew.. page35 was identified as having a skew of about -0.22 degrees.

If we look for breaks between horizontal swaths in the page 35 as scanned, we find there are 51 swaths: 14 noise, 6 double-line height swaths, and 31 accurate lines.

If we look for breaks in the deskewed version, we find 49 swaths, 10 are 44 high, 15 are 45 high, and all but 6 are between 35 and 47. There are three tall lines which have superscripts, and 6 small swaths that are noise, including the image of the staple holes in the corner. All 43 lines are correctly isolated.

Looking for vertical breaks (between characters) is plausible on this page only because it is set in a fixed-width font. In the deskewed version, we can easily separate about 49 character spaces: each ``column'' is a vertical swatch that is between 24 and 27 pixels, plus a few spaces that are marginal noise, and a few spaces that are multiple characters wide: two, three, or four.

In the as-scanned version, these vertical breaks are obscured. On a page set with a variable-width font we would not expect much success in aligning characters in columns anyway, so we would not make much use of this inter-character spacing, in general. (The vertical break detection would still be useful to separate columns in a 2-column page.)

Appendix: What is CL+TIFF

It is handy to have a Common Lisp with the file-acessing library for TIF (Tagged Image Format) pre-loaded. It is faster to load up and may have addition sharing of code if there are several processes using the software.

Wegif made a new version of Allegro Common Lisp 4.2 linked up with appropriate entry points from the tiff library. The linkage is done by putting together a dummy C language program (we call it dummy.c here) that mentions appropriate entry points. One then builds a new Allegro Common Lisp using its script called config, as defined in its build directory, using this dummy program and the tiff library. (Anyone who has installed a copy of this lisp as distributed will have addressed the issue of this config file already.)

Our script looks like this:

  cc -c dummy.c
  sh config temp=/usr/tmp dummy.o /usr/sww/lib/libtiff.a

What should dummy.c contain?

We ran a perl programgif to construct that file by collecting the symbols in the tiff library, but one can merely look at dummy.c below. Almost all of these entry points are unused in the current project, but we left them in.

The comments repeatedly mention the local library directory sww which stands for SoftWare Warehouse. The rest of the code is nothing more than a list of names.

int
dummyRoutinesToForceLoadOfLibTiff()
{
  TIFFModeCCITTFax3();  /* From /usr/sww/lib/libtiff.a(tif_fax3.o) */
  Fax3Decode2DRow();  /* From /usr/sww/lib/libtiff.a(tif_fax3.o) */
  Fax3PutBits();  /* From /usr/sww/lib/libtiff.a(tif_fax3.o) */
  Fax3PutEOL();  /* From /usr/sww/lib/libtiff.a(tif_fax3.o) */
  Fax3Encode2DRow();  /* From /usr/sww/lib/libtiff.a(tif_fax3.o) */
  TIFFInitCCITTFax3();  /* From /usr/sww/lib/libtiff.a(tif_fax3.o) */
  TIFFInitCCITTFax4();  /* From /usr/sww/lib/libtiff.a(tif_fax4.o) */
  TIFFVGetFieldDefaulted();  /* From /usr/sww/lib/libtiff.a(tif_aux.o) */
  TIFFGetFieldDefaulted();  /* From /usr/sww/lib/libtiff.a(tif_aux.o) */
  TIFFInitCCITTRLE();  /* From /usr/sww/lib/libtiff.a(tif_ccittrle.o) */
  TIFFInitCCITTRLEW();  /* From /usr/sww/lib/libtiff.a(tif_ccittrle.o) */
  TIFFClose();  /* From /usr/sww/lib/libtiff.a(tif_close.o) */
  TIFFNoRowEncode();  /* From /usr/sww/lib/libtiff.a(tif_compress.o) */
  TIFFNoStripEncode();  /* From /usr/sww/lib/libtiff.a(tif_compress.o) */
  TIFFNoTileEncode();  /* From /usr/sww/lib/libtiff.a(tif_compress.o) */
  TIFFNoRowDecode();  /* From /usr/sww/lib/libtiff.a(tif_compress.o) */
  TIFFNoStripDecode();  /* From /usr/sww/lib/libtiff.a(tif_compress.o) */
  TIFFNoTileDecode();  /* From /usr/sww/lib/libtiff.a(tif_compress.o) */
  TIFFSetCompressionScheme();  /* From /usr/sww/lib/libtiff.a(tif_compress.o) *  TIFFSetField();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */^_/);
  TIFFVSetField();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */
  TIFFGetField();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */
  TIFFVGetField();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */
  TIFFFreeDirectory();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */
  TIFFDefaultDirectory();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */
  TIFFSetDirectory();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */
  TIFFLastDirectory();  /* From /usr/sww/lib/libtiff.a(tif_dir.o) */
  TIFFFindFieldInfo();  /* From /usr/sww/lib/libtiff.a(tif_dirinfo.o) */
  TIFFFieldWithTag();  /* From /usr/sww/lib/libtiff.a(tif_dirinfo.o) */
  TIFFReadDirectory();  /* From /usr/sww/lib/libtiff.a(tif_dirread.o) */
  TIFFWriteDirectory();  /* From /usr/sww/lib/libtiff.a(tif_dirwrite.
  TIFFInitDumpMode();  /* From /usr/sww/lib/libtiff.a(tif_dumpmode.o) */
  TIFFSetErrorHandler();  /* From /usr/sww/lib/libtiff.a(tif_error.o) */
  TIFFError();  /* From /usr/sww/lib/libtiff.a(tif_error.o) */
  TIFFReadRGBAImage();  /* From /usr/sww/lib/libtiff.a(tif_getimage.o) */
  TIFFInitJPEG();  /* From /usr/sww/lib/libtiff.a(tif_jpeg.o) */
  TIFFFlush();  /* From /usr/sww/lib/libtiff.a(tif_flush.o) */
  TIFFFlushData();  /* From /usr/sww/lib/libtiff.a(tif_flush.o) */
  TIFFInitLZW();  /* From /usr/sww/lib/libtiff.a(tif_lzw.o) */
  TIFFInitNeXT();  /* From /usr/sww/lib/libtiff.a(tif_next.o) */
  TIFFClientOpen();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFScanlineSize();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFFileName();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFFileno();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFGetMode();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFIsTiled();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFCurrentRow();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFCurrentDirectory();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFCurrentStrip();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFCurrentTile();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFIsByteSwapped();  /* From /usr/sww/lib/libtiff.a(tif_open.o) */
  TIFFInitPackBits();  /* From /usr/sww/lib/libtiff.a(tif_packbits.o) */
  TIFFPrintDirectory();  /* From /usr/sww/lib/libtiff.a(tif_print.o) */
  TIFFReadScanline();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFReadEncodedStrip();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFReadRawStrip();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFReadBufferSetup();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFReadTile();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFReadEncodedTile();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFReadRawTile();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFNoPostDecode();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFSwab16BitData();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFSwab32BitData();  /* From /usr/sww/lib/libtiff.a(tif_read.o) */
  TIFFSwabShort();  /* From /usr/sww/lib/libtiff.a(tif_swab.o) */
  TIFFSwabLong();  /* From /usr/sww/lib/libtiff.a(tif_swab.o) */
  TIFFSwabArrayOfShort();  /* From /usr/sww/lib/libtiff.a(tif_swab.o) */
  TIFFSwabArrayOfLong();  /* From /usr/sww/lib/libtiff.a(tif_swab.o) */
  TIFFGetBitRevTable();  /* From /usr/sww/lib/libtiff.a(tif_swab.o) */
  TIFFReverseBits();  /* From /usr/sww/lib/libtiff.a(tif_swab.o) */
  TIFFComputeStrip();  /* From /usr/sww/lib/libtiff.a(tif_strip.o) */
  TIFFNumberOfStrips();  /* From /usr/sww/lib/libtiff.a(tif_strip.o) */
  TIFFVStripSize();  /* From /usr/sww/lib/libtiff.a(tif_strip.o) */
  TIFFStripSize();  /* From /usr/sww/lib/libtiff.a(tif_strip.o) */
  TIFFInitThunderScan();  /* From /usr/sww/lib/libtiff.a(tif_thunder.o) */
  TIFFComputeTile();  /* From /usr/sww/lib/libtiff.a(tif_tile.o) */
  TIFFCheckTile();  /* From /usr/sww/lib/libtiff.a(tif_tile.o) */
  TIFFNumberOfTiles();  /* From /usr/sww/lib/libtiff.a(tif_tile.o) */
  TIFFTileRowSize();  /* From /usr/sww/lib/libtiff.a(tif_tile.o) */
  TIFFVTileSize();  /* From /usr/sww/lib/libtiff.a(tif_tile.o) */
  TIFFTileSize();  /* From /usr/sww/lib/libtiff.a(tif_tile.o) */
  TIFFFdOpen();  /* From /usr/sww/lib/libtiff.a(tif_unix.o) */
  TIFFOpen();  /* From /usr/sww/lib/libtiff.a(tif_unix.o) */
  TIFFGetVersion();  /* From /usr/sww/lib/libtiff.a(tif_version.o) */
  TIFFSetWarningHandler();  /* From /usr/sww/lib/libtiff.a(tif_warning.o) */
  TIFFWarning();  /* From /usr/sww/lib/libtiff.a(tif_warning.o) */
  TIFFWriteScanline();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
  TIFFWriteEncodedStrip();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
  TIFFWriteRawStrip();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
  TIFFWriteTile();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
  TIFFWriteEncodedTile();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
  TIFFWriteRawTile();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
  TIFFFlushData1();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
  TIFFSetWriteOffset();  /* From /usr/sww/lib/libtiff.a(tif_write.o) */
}

The program above was created by the script which you might wish to run should libtiff change:

#!/usr/sww/bin/perl5 -w

$LIB = "/usr/sww/lib/libtiff.a";
if (!open(TOC, "nm $LIB |")) {
  print STDERR "Couldn't read namelist from $LIB!\n";
  exit 1;
}

# print "#include \"/usr/sww/include/tiffio.h\"\n\n";
print "int\ndummyRoutinesToForceLoadOfLibTiff()\n{\n";

$obj = 0;
while (<TOC>) {
  chop;
  if (/^\s*$/) {
    # ignore blank lines
  } elsif (/^Name\s+Value\s+Scope\s+Type\s+Subspace\s*$/) {
    # ignore header line
  } elsif (/^Symbols from ([^\[]+)\[([^\]]+)\]:\s*$/) {
    $obj++;
    $objname = "$1($2)";
  } elsif (/^(\S+)\s*\|\s*\S*\|(\S+)\s*\|(\S+)\s*\|(\S*)\s*$/) {
    ($name,$scope,$type,$subspace) = ($1,$2,$3,$4);
    if ($scope eq 'extern' && $type eq 'entry' && $subspace eq '$CODE$') {
      print "  $name();  /* From $objname */\n" if ($name !~ /^_/);
    }
  } else {
    print "\t/* $_ */\n";
  }
}
close(TOC);

print "}\n";
exit 0;

replace-pict (smallpict bigpict x y) takes two picture structures and overwrites the bigpict with the small one, starting with the smaller one's lower-left location on location x,y on the larger one, and extending according to its size.

A number of utilities for dealing with individual rows are available, documented in the source code.



next up previous
Next: About this document Up: A Suite of Lisp Previous: Appendix:sample texts



Class Account
Fri Dec 1 14:31:16 PST 1995