CS294-9: Document Image Analysis R. Fateman and H. Baird Fall, 2000 University of California at Berkeley Assignment 3: Due October 11, 2000 Here we ask you to experiment with symbol recognition and classification. Part I: You are given a collection of characters in some file (for example, numbers.txt which you have already demonstrated you can read in to some data structure in your assignment 1). Choose some n easily computable properties for each character, and use this to encode each character as just that vector. As we pointed out in class, the number of properties n in a real OCR system might be a few hundred. Probably each property might be an 8-bit byte or a single bit. In class we discussed the c vs. e discrimination problem by identifying each glyph as an 81-bit vector: a 9 by 9 array of 0/1 bits. Conceptually, this places each vector at some corner of an 81-dimensional cube. Explain what properties you've computed and why you think they might be helpful. If you have no other ideas, you can compute the 27 properties suggested in Part II. Part II:` Here we've done Part I for you and have provided you with some data. Actually we've supplied a lot more data than you'd get from just reading numbers.txt, but hardly as much data as we've seen is necessary. (To distinguish c from e, it seems that we continue to improve our performance with 10^5 to 10^6 samples. Here we are giving you about 10^2 samples and 10 digits. Write a program that can read the files p0 through p9. Feel free to edit your own copy of these files if it helps you. You can, by inserting left and right parentheses, read a whole file with one read command in Lisp. These files are in the images/pvs subdirectory of the class home page. Each of the property vectors of the symbols can be thought of as a point on a grid in 27-dimensional space. Try to find a clustering algorithm that can re-separate these 1127 characters into those 10 sets represented by the 10 files. (There are a few broken characters tossed in to make life hard.) This is the unsupervised learning problem. A nearest-neighbor approach would simply say: vectors v1 and v2 are so close, they should be put into the same cluster. Vector v3 is too far away from either of them, so we will put it in a different cluster. Don't be surprised if you can't quite do this. You should, however, be able to get all vectors separated into a few clusters in which all the characters are the same. (For example, I found it easy to separate out 5's into several different clusters, each containing a variation of the 5.) Explain your clustering algorithm and how effective it was. It may be helpful to identify characters that you read in from each of the files by index, e.g. p8[3] is the 4th vector among the 8's. Part III It seems quite unlikely that humans store each of thousands of examples of hundreds of characters in our data set in order to allow us to figure out the identity of a new symbol. As we've seen, even for computers, this is not such a great approach. Consider a simpler characterization of the data in the files. For example, take the average of all the vectors in each given file. Then use just those 10 vectors to discriminate among all the characters. How good is this measure? In particular, test to see if every character in a file is closer to the average of its file than it is to the average of some other file. Explain why, in general, this might not work, even in low dimensions. Part IV What can you say about the likelihood of one digit being confused for another, given this training set? If you wished to improve the recognition and classification program of part III, how would you approach it? (This question is quite open-ended. Sketch a few ideas.)