CS294-9: Document Image Analysis
R. Fateman and H. Baird
Fall, 2000
University of California at Berkeley

Assignment 3: Due October 11, 2000

Here we ask you to experiment with symbol recognition
and classification.

Part I: 

You are given a collection of characters in some file (for example,
numbers.txt which you have already demonstrated you can read in to
some data structure in your assignment 1).  Choose some n easily
computable properties for each character, and use this to encode each
character as just that vector.  As we pointed out in class, the number
of properties n in a real OCR system might be a few hundred. Probably
each property might be an 8-bit byte or a single bit.

In class we discussed the c vs. e discrimination problem by
identifying each glyph as an 81-bit vector: a 9 by 9 array of 0/1
bits.  Conceptually, this places each vector at some corner of an
81-dimensional cube.

Explain what properties you've computed and why you think they might
be helpful.  If you have no other ideas, you can compute the
27 properties suggested in Part II.

Part II:`

Here we've done Part I for you and have provided you with some data.
Actually we've supplied a lot more data than you'd get from just
reading numbers.txt, but hardly as much data as we've seen is
necessary.  (To distinguish c from e, it seems that we continue
to improve our performance with 10^5 to 10^6 samples. Here we are
giving you about 10^2 samples and 10 digits.

Write a program that can read the files p0 through p9.  Feel free to
edit your own copy of these files if it helps you.  You can, by
inserting left and right parentheses, read a whole file with one read
command in Lisp.  These files are in the images/pvs subdirectory of
the class home page.

Each of the property vectors of the symbols can be thought of as a
point on a grid in 27-dimensional space. Try to find a clustering
algorithm that can re-separate these 1127 characters into those 10
sets represented by the 10 files.  (There are a few broken characters
tossed in to make life hard.) This is the unsupervised learning problem.

A nearest-neighbor approach would simply say: vectors v1 and v2 are so
close, they should be put into the same cluster. Vector v3 is too far
away from either of them, so we will put it in a different cluster.
Don't be surprised if you can't quite do this.  You should, however,
be able to get all vectors separated into a few clusters in which all
the characters are the same.  (For example, I found it easy to
separate out 5's into several different clusters, each containing a
variation of the 5.)

Explain your clustering algorithm and how effective it was.

It may be helpful to identify characters that you read in from
each of the files by index, e.g. p8[3] is the 4th vector among
the 8's.


Part III

It seems quite unlikely that humans store each of thousands of
examples of hundreds of characters in our data set in order to allow
us to figure out the identity of a new symbol. As we've seen, even for
computers, this is not such a great approach.  Consider a simpler
characterization of the data in the files.  For example, take the
average of all the vectors in each given file.  Then use just those 10
vectors to discriminate among all the characters.  How good is this
measure? In particular, test to see if every character in a file is
closer to the average of its file than it is to the average of some
other file.

Explain why, in general, this might not work, even in 
low dimensions.

Part IV

What can you say about the likelihood of one digit being
confused for another, given this training set? 

If you wished to improve the recognition and classification
program of part III, how would you approach it?  (This question
is quite open-ended. Sketch a few ideas.)