OCRchie - Optical Character Recognition Library Features
Image Processing
- Reads in file in standard 2 level TIFF format
- Detects skew angle
- Deskews image
- Creates both RLE Map and BitMap representations of the
document.
Learning
Learn from a sample TIFF file and ascii translation.
- Makes OCRChie easy to integrate into any application
with consistant font input.
- No language assuptions are made.
- Might learn from the first page of a book and translate
the rest.
- Save Learned characters to file.
- Read Learned characters from file without relearning.
Interactive Learning
- Mark individual component to add to learned characters
- After recognition, add corrections to learned characters,
averaging those that are close in distance.
Component Extraction
Extracts characters as components by
- Determining separation of lines
- Performing connected component analysis
- Projecting up and down from the connected component within
the line boundaries
Note: i, j remain intact.
The 28 properties of the character are set as
- The gray scale of each of 25 regions.
- The height/width ratio of the character.
- Whether the character is vertically disjoint - i j
- The gray scale of the top and bottom 1/2 divisions
Determines grouping in relation to baseline
- Group 0 - regular small chars -> acsow etc.
- Group 1 - tail hangs down -> yjp
- Group 2 - tall chars -> OWD
- Group 3 - Both tall and hang down -> ()
- Group 4 - floating like ->' - *
Character Recognition
- Properties are compared to the learned characters
- Matches are assigned a confidence setting of 0 - 255.
- First compared to own group. If no good match then compare
to other groups.
- If confidence is low and character is wide, split it
at thinnest point horizontally.
- If confidence is low and character is tall, split at
thinnest point vertically.
- If confidence is low and there is more than one connected
component, split by connected components.
- If confidence is low attempt to join with adjacent components.
Word Extraction
- Determines average spacing and delimits words.
- Full ascii translation of words, pointers to their components
and an overall confidence are packaged into a word list for easy import
into spell check or other post processing.
- Each characer is considered a separate word within equations
Output Options
- Ascii file output
- Scanworx wordbox format
Zoning
- A page is zoned by performing a horizontal then vertical
merge of components to create boundaries for columns and other regions.
The parameters for merging can be set by the user and the regions can then
be corrected in the user interface.
User Interface
- Demonstrates the features of OCRchie library
- Allows interactive correction of translation
- RLEMap Display to speed up display
- Set confidence parameters and display options as well
as global variables
- Split components before learning and/or recognition.
User may split components by 1) Connected Components 2)Vertical split at
thinnest point or 3)Horizontal split at thinnest point
- Join components before learning and/or recognition.
- Mark a specific component to learn independently of the
rest of the document.
- Mark equation boundaries. (Each character within an equation
will be saved as a separate word.)
- Zone a document and make corrections. Select an active
zone to perform recogintion and operations listed above.