CS169 - Software Engineering
Project Specification and Requirements
While there are currently many OCR Software packages available for retail sale, to the best of our knowledge, there are no publicly available free OCR libraries. Our goal is to produce a basic OCR library in C++ which will be freely available on the Internet to all who want to use it. Our goals are to
The general function of the software will be to translate scanned document TIFF files into ASCII text. The program will first provide the capability to deskew the image (if scanned at an angle). Then all or part of the document can be translated into ASCII text. The system will have the capability to "learn" and store new characters which will be used in future comparisons. In addition we will be creating a simple TCL/TK interface for learning new characters, and demonstrating the use of our OCR package.
This software package is to be based in large part on the Lisp OCR system as written by professor Fateman. We are using his general algorithms for this OCR project. We would like to thank him for the copious support, input, and source code he has given us already.
General I/O and Data Flow -
Input Files:
| GUI - Deskew, mark text, learn new characters | Character Recognition, User Correction
\ / Output Files:
The Page
Our primary data object is a Page. A Page can represent either an entire scanned page as read in from the TIFF file or some subsection of that page that has been extracted for character recognition. Page inherits from RLEMap, a class which has the run length encoded representation of the TIFF file and various access, manipulation and I/O routines (as specified in assignment 1 and Appendix A).
A Page has two significant data members in addition to those of the RLEMap.
The first is an array of lists of connected components. Each element of the array points to a theoretical row of text. Each connected component normally represents a section of the RLEMap which corresponds to one character. Connected components, however, do not hold any graphic representation, but rather contain information about the section of the RLEMap from which they were extracted. This information includes the upper left point and lower right point, as well as a property vector for the region. (Property vectors are outlined in Appendix B). Finally, a connected component has a field for its ASCII identification and a confidence rating low, medium, or high.
|fRowComponents[0]->"A"->"E"->"B" ....
|fRowComponents[1]->"E"->"A" .....
|fRowComponents[2]->"A"->"E"->"B" ...
| .
|fRowComponents[fnumRows -1]->
The other major data member of a Page is a list of CharGroups. A CharGroup represents a group of connected components whos property vectors are similar enough to classify them as the same character. A CharGroup class contains a property vector which is the average of all of the property vectors of the components belonging to that group and a list of pointers to candidate components. Initial groups are created from the previously learned characters. A heuristic "distance" is calculated from each candidate component to the existing groups. If the distance is within a predefined threshold, the component is added to the group, otherwise a new group is created with the ASCII identification initialized as NULL.
Each Component is assigned to exactly one CharGroup. The idea is that the user interface would allow unidentified groups to be given an ASCII identification. In the above example there would be three character groups, one with a list of pointers to all of the A's one to the B's and one to the E's. (Note: We could allow group reassignment but this would require doubly linked lists and might be ugly.)
Both the Row Components and CharGroups are null upon creation of the page but are built up by member functions. Detailed data representation information is included in Appendix B.) The User Interface
(Current big holes in representation for white space between words and disjoint characters (i,j,;,:)
The user interface will be constructed in TCL/TK. We feel that TCL/TK will provide an easy GUI to illustrate our OCR library.
Our simple OCR system will make certain assumptions about the type of data that will be encountered. In future releases, we hope that some of these restrictions can be relaxed.
Beyond these first two restrictions our plan is to make our code very portable. To use the OCR suite on a different machine with a different TIFF library would entail only modifying the readMap and writeMap functions.
3) The software will correct skewed images of up to ten degrees. Skewed images of a greater angle cannot be corrected by the system.
4) Only single column text based, black on white images are supported.
5) Only the English character set is expected and supported by the software.
6) Text cannot be underlined (really bad for connected components) or highlighted. Nice neat standard black text on white paper is best.
Functional Requirements Specifications and Solution
The functional requirements and specifications of our system can be divided into four categories, 1) Skew detection/correction, 2)Page layout analysis, 3) Character recognition and 4) The user interface. These are outlined below.
The lines in a bitmap may not be aligned with text lines of the document due to human factors and the limitations of mechanical devices. The bitmap is usually skewed at a slight angle. A deskewing process is needed to detect a skew angle and produce a new unskewed bitmap. Below is an analysis of the different methods we may use.
There are three basic methods available: projection methods, Hough transform methods and nearest-neighbor methods. The projection methods are most efficient among those three methods but are proved to effective only up to a skew angle of 10 degrees and less accurate. Hough transform methods are much more accurate than projections methods but considerable slower than the projection methods. The Hugh transform methods are effective to a skew angle of 15 degrees. The nearest-neighbor methods are more accurate than both methods and have no limitation on a skew angles, but the cost is very higher and the methods are much more sensitive to the layout of a text page. Although there are many improvements on the those methods, due to the limitation of time and resources, we can not fully investigate them.
Since we assume a bitmap from a scanned text page is skewed slightly up to an angle of 10 degrees and the limitation of time and resources, we pick the projection methods as our prototype method to deskew a bitmap. A further improvement will be made based on the projection methods.
Roughly, at different skew angles, the black dots along each row of bitmaps are counted. An angle dominates the other angle if at this angle, the number of rows with most dots is far great than other angles statistically. The skew angle is the angle which dominates other angle.
The process will be implemented at the increment of 1 degree.
The difficulty in the process is to recalculate the distribution of black dots after a slight rotation of a bitmap. This is the main obstacle to limit the capability of this method.
One feature is to consider an interactive deskewing process.
There are two utility functions:
Very little page layout analysis will be done. However, the end of text rows and paragraphs will be detected and appropriate carriage returns entered in the ASCII file.
3) Character Recognition (??? Need stronger specification )
The gray scale is a number between 0 (all white) and 255 (all black).
if height/width ratio difference is unacceptable
distance = +oo else if height difference is unacceptable
distance = +oo else
for region = 0 to 24
distance += square(grayScale(CharGroup,region) - grayScale(Component, region)) endif
III. NonFunctional Requirements
Product requirements:
Speed and Memory Efficiency
Our focus in writing this software is functionality, rather than speed or memory efficiency. Of course, the reality is that time and space are valuable resources, but they are, in fact, valuable resources that OCARchie (at least in its initial implementation) will consume with a voracious appetite. Even so, OCArchie will hold only one scanned image in memory. It is assumed that user must have enough memory to hold a page representation in three formats:
Process Requirements
Implementation and Portability
Software is to be written in C++ and TCL/TK.
The software must be portable to other Unix systems when recompiled, provided the Silicon Graphics TIFF library is present.
Delivery
Software must be completed, tested and documented by December 1995.
IV. Glossary
OCR-Optical Character Recognition
RLE-Run Length Encoding. Pixels from each row of a TIFF image are
represented as a list of black pairs. Each pair contains the starting and ending column of a group of black pixels.
TIFF-Grafic storage of scanned image. Accessed by Silicon Graphics
library. (???? Don't know what TIFF stands for)
APPENDIX A
Functional requirements specification
RLEMap functional specification
Since Page inherits from RLEMap, the functions listed below are of course available to Page as well. Listed below each requirement is the member function which will perform the function, a description of arguments, constraints, effects and return values.
There are several supporting classes which are referred to in this section. These are outlined briefly here and in more detail in Appendix B.
Point - Consists of an X and Y coordinate.
Component - consists of an upper-left point, lower right-point, the number of black bits in the component, the property vector, an ASCII identification and a confidence rating.
Components - a linked list of objects of type Component. CompPtrs - a linked list of pointers to Components.
CharGroup - Inherits from Component and CompPtrs. As described in the previous section, the property vector of the character group is the average of each of the components pointed two.
RLEPair - A starting and ending column, used for Run Length Encoding.
RLEMap Functions
Unary Commands
Member function : RLEMap() (constructor) Arguments: none Effects: The constructor creates the new object. The RLEMap is empty and memory is not allocated for pixel information until readMap() is called on the object. fImageWidth, fImageLength, and fStat are initialized to 0,0 and EMPTY respectively Return Value: Pointer to the new RLEMap object
Member function : ~RLEMap() Arguments: none Effects: The destructor releases all memory holding pixel
information. Invoked on delete operation. Return Value: none
Member Function: RLEMap * copy(Point ul=NOPNT, Point lr=NOPNT) Arguments: ul is the upper left point at which the copy
should start. lr is the lower left point at which the copy should start. Effects: If the default values of ul=NOPNT and lr=NOPNT are
used, the entire map is copied. The copy function allocates a new RLEMap as well as identical bit information memory. Then it copies size, status and bit information from the server to the new object. If a upper left point and lower right point are specified, a submap is copied and size information adjusted accordingly. Return Value: Returns a pointer to the new object.
Selector Functions
7. Determine Length of a RLEMap Member Function: int imageLength() Arguments: None Effect - This Selector returns the number of vertical bits in map either as read from the TIFF file or determined by the submap operation (see copy operation). 8. Determine Width of a RLEMap Member Function: int imageWidth() Arguments: None Return Value - This Selector returns the number of horizontal bits in either as read from the TIFF file or determined by the submap operation (see binary operations).
Selector/Modifier functions
9. Determine/Set the status of a RLEMap Member Function: MapStatus & status() Arguments: None Effect: As a modifier allows client to set the status of a RLEMap. Return Value: As a selector returns the current status of the RLEMap. Comment: A client may need to reset the status to VALID after an I/O error.
3. Read a RLEMap from a TIFF file. Member function : MapStatus readMap(Char * filename) Arguments: filename is name of the TIFF file to open. Constraints: filename must be the name of a TIFF formatted file to ensure a proper read . Effects: Opens filename and reads TIFF file into RLEMap, recording size and pixel information for access by other member functions. On exit from this function the TIFF file is closed.If the RLEMap is not empty, memory containing old pixel information is released before the new read. Return Value: Returns a value of type MapStatus. MapStatus is an enumerated type. Potential return values are VALID, OPENERROR, READERROR 4. Write a RLEMap to TIFF file. (This is Probably not needed, but left here for now) Member Function: MapStatus writeMap(char * filename) Arguments: The name of the TIFF file to write to. Effects: Opens filename and writes RLEMap to filename in TIFF format. Return Value: Returns a value of type MapStatus. Potential return values are VALID, OPENERROR, WRITEERROR
Data Access and low level manipulation functions
5. Clear a RLEMap. Member function: MapStatus clear(Color clr) Arguments: clr is of enumerated type Color which can take on the values BLACK or WHITE. Effect: RLEMap is cleared to all one color. If an error occurs the status of the map is set to OTHERERROR and returned Return Value: Status of RLEMap 6. Set a single bit in a RLEMap to a specified color. Member function: MapStatus setBit(Point point, Color clr) Arguments: X and Y values of point are the row and column to change clr is BLACK or WHITE Effect: Changes the color of BMPA bit in position (X,Y) to the color clr. If an error occurs changes status to OTHERERROR Return Value: Status of RLEMap 7. Read and return a single bit from a RLEMap. Member function: Color readBit(Point point) Arguments:point contains the column and row of the bit to read. Constraints: point.x() < imageWidth() and point.y() < imageLength Return Value: Returns color of bit.
High Level Access and manipulation
8. Determine angle which the document must be rotated so that rows are orhthoganal to document edge. Member function: Angle skewAngle(); Arguments: none Constraints: Image must be rotated at no more than a 10 degree angle in either direction to ensure accurate result. Return Value: Angle at which the document must be rotated in a counterclockwise direction before OCR can be performed. Angle is just an alias for float. Comment: No rotation is actually performed in this function 9. Rotate the map a specified angle. Member function: MapStatus rotateMap(Angle angl) Arguments: Angle to rotate map Constraints: -10.0 < angl < 10.0 Effects: RLE representation is actually changed to represent rotated map. Return Value: Status of map after operation
10. Display on Standard output // Just for testing Member function : MapStatus display(Point point, int termWidth, int termLength, int scaleDownX, int scaleDownY) Arguments: point's X and Y values indicate the row and column at which to start the print. termWidth and termlength are the number of columns and rows on stdout.scaleDown is a scale down factor. If scaleDownX = i, every ith pixel in row will print. If scaleDownY = j, every jth pixel in a column will print. Constraints: point.X >=0, point.Y>=0 0 < scaleDownX <imageWidth 0 < termWidth <= # columns on screen 0 < termLength <= # rows on screen 0 < scaleDownX < imageLength. 0 < scaleDownY < imageLength. Effects: An X will print on stout for each non-white pixel. If scaleDown factors are 1, the rectangle bounded by (X,Y) and (X+termWidth,Y+termlength) will print. For scaledown > 1 the rectangle bounded by (X,Y) (X + (termwidth * scaleDownX), Y + (termlength * scaleDownY)) will print. 10. Extract a Bitmap representation for TCL/TK Member function: BitMap * ExtractComp(Point ul, Point lr, int scaleDownx, int scaleDowny) Arguments: ul and lr bound the region to be extracted. scaleDownx and scaledowny are as described for display. Constraints: ul and lr must be within map range. Scaledown factors constraints as described for display. Effect: Allocate a BitMap for TCL/TK use. Copy designated contents. Return Value: Pointer to new BitMap
The binary operations allow for comparison of two bit maps. Operator overloading is used for these functions to simplify use. The operators for RLEMaps overload the operators for bitwise logical operations. We may not need these but leave them in for now, unchanged from assignment 1.
Member function: MapStatus operator |= (LowMemMap & BMPB)
MapStatus operator |= (SpeedMap & BMPB) Arguments: BMPB is a reference to the Map with which the server
object (BMPA) will be compared. Constraints: BMPA and BMPB are of equal size and are VALID. Effect: Each white bit BMPA(i,j) will be set to BMPB(i,j).
The result will be a logical or. If an error occurs in the operation the status of BMPA will be set to OTHERERROR Return Value: The status value for BMPA. Comment: BMPA will be changed with the use of this operator.
Sample use: BMPA |= BMPB
Member function: MapStatus operator ^= (LowMemMap & BMPB) Member function: MapStatus operator ^= (SpeedMap & BMPB) Comparable to IOR described above but performs XOR operation.
Member function: MapStatus operator &= (LowMemMap & BMPB) Member function: MapStatus operator &= (SpeedMap & BMPB) Comparable to IOR described above but performs AND operation.
Member function: Boolean operator == (LowMemMap & BMPB)
Boolean operator == (SpeedMap & BMPB) Arguments: BMPB is a reference to the Map with which the server
object (BMPA) will be compared. Constraints: BMPA and BMPB are VALID. Effect: The function compares sizes and then compares bit
information of BMPA and BMPB. Return Value: False if BMPA does not have the same size and/or
content as BMPB. Comment: Sample use if (BMPA == BMPB) ...
Page Functions - additional functionality of a page.
Member function: MapStatus extractRows(); Arguments: none Constraints: Map must already be deskewed Effect: Sets fnumrows. Allocates array for fRowcomponents
and creates a list of components for each row. Each component is given a property vector as described in functional requirements section. Return Value: Status of Page after operation
Member function: MapStatus readLearnedChars(char *filename); Arguments: name of file where characters are stored Constraints: filename must have been previously created
by writeLearnedChars. Return Value: Status of Page after operation
Member function writeLearnedChars(char *filename) Arguments: name of file where characters are to be stored Effect: Writes learned character information to file Return Value: Status of Page after operation
Member function: CompPtrs recognize(int row); Arguments: row is the row number to analyze Constraints: 0 < row < fnumRows Effect: For each component C in row, finds an existing character group G such that distance(C,G) < threshold. the asciiId for component c is set to match that of the character group. The character group is modified to point to the component and the character group property vector is modified to average in the new component. The addresses of any Groups that could not be recognized are appended to a list for return Return Value: Return list of new (unrecognized) Character
groups.
Component Functions
To be filled in later
APPENDIX B
Non Functional requirements specification
??? Need to do some math on memory requirements
Refer to .h files
page.h
pagesup.h
rlemap.h
bitmap.h
Hardware
Database requirements
Index