Optical Character Recognition

Vivek Kwatra

The problem was to find vowels in an image of text. This page contains a description of the technique and images showing results.
The recognition technique was implemented in MATLAB (mathworks, mathtools).

The basic steps involved in the recognition are as follows -

1. Identifying characteristics for recognition - The first step was to identify those characteristics of vowels which distinguish them from the consonants. These were identified as shape mask (image in upright position at a particular size), euler number, area, orientation, and rotation invariant moments (Hu Moments).

2. Segmentation of letters - The letters are first segmented by thresholding. The thresholding is done so as to try to preserve the euler number and separate the letters from each other. This is accomplished by identifying the holes in the image, dilating them and adding back to the original image and again thresholding this image. This image with the holes filled is then ANDed logically with the previously thresholded image to get the final thresholded image.

3. Training - The program was trained from the images for Hu moments and areas. For the training part, the program would compute the Hu moments and area of each letter, ask the user to identify the class (a,e,i,o,u or no vowel) of the letter, and then store the data in a file.

4. Recognition - During recognition, the program segments the input image into letters and then tests each letter for being a vowel. It first checks the euler number of the letter. An euler number of  0 indicated a high probability of a, e, o (if there is a vowel at all). An euler number of 1 indicates that the only possible vowels are i and u, but a, e, o can also disguise aa letters with euler number equal to 1 (if connectivity is not preserved). a has a high probability of occurring with 2 holes too.

After checking the euler number the areas of the letters are checked. If they lie within a desirable range (as noted during training), they are further processed or else they are identified as not being vowels.

The letters are now checked for shape similarity using shape masks. A shape mask each of a, e, i, o, u is computed beforehand and stored. The similarity is now measured as the number of pixels common in both the letter and the vowel minus the pixels which are not common in both.

If the shapes match closely, a vowel is declared, or else moments are matched using Mahalanobis distance between the moments of the current letter and the stored moments of each vowel. If these are within a threshold, a vowel is declared.

Some special processing is done to distinguish b and n. For an upright position u has a space above it and n has space below it. This is used to see how closely a letter which gave close results to u, matches n. If closeness to n is high, it is not considered a vowel.

To identify i, if a letter shows close match to the long stick in i, it is dilated along its orientation and objects overlapping the dilated stick are identified. If one of them matches a dot,  both of them are identified as being part of the i.

Assumptions & Weaknesses -

The techniques employed here assume certain properties about the characters which may not be preserved owing to the size of the character, the noise in the image and thresholding of the image.

One of such assumptions is about euler number. It is assumed that the euler number for a,e,o should almost always be 0,1 or -1 which may not always be true.
The method assumes  that segmentation separates each and every letter in the image, which is not always the case.
A certain size and noisiness in the image is assumed for recognition, although Hu moments are used which are scale and rotation invariant. These are used if the shape mask fails, which is most likely to happen when the shapes mismatch.

u and n be are distinguished assuming that rotation is not more than 90 degrees about the vertical on either side.

The program fails most often by confusing u and n. It also misses vowels at times and picks up consonants like w etc. The most important reason for failure is the insubstantial nature of the training data. Also noise and segmentation often causes drastic changes in areas and moments. At time the segmentation cannot separate two letters which are very close by. Also the euler number is not always preserved. The size of letters is also another problem. Since the characters are very small, the noise affects them a lot. Larger characters may be easier to recognize because noise would only affecting them at their boundaries and hence descriptors (like moments and areas) would be more stable.

The moments and areas as descriptors are pretty general. To do general purpose OCR, the letters can first be separated into classes based on euler numbers (with it being possible for a letter to appear in more than one class). Discarding the possibility of text being rotated can be helpful by allowing greater use of morphology and shape masking.

The problem gave an idea of practical vision applications, and the level of 'hacks' involved in them. It also helped develop insights on how to develop general recognition frameworks, and deal with recognition problems in general. The problems due to noise and discretization of images also played havoc. Overall, it was fun.


These Images show vowels marked in red.

Accuracy : 25/26

Accuracy : 25/27

Accuracy : 102/131