Automatic Hindi OCR Error Correction Using MLM-BERT

ABSTRACT


INTRODUCTION
Hindi, being a major regional language in India with over 500 million speakers worldwide, requires an accurate Optical Character Recognition (OCR) solution to address automatic document analysis problems.OCR refers to the identification of text from digitally scanned documents.Preserving the meaning of the text and the content can effectively be utilized for natural language applications.In the context of Indic languages such as Hindi, the accuracy of OCR output is often low and requires additional post-processing to refine it further.Hindi poses several challenges for OCR systems.These challenges include the complexity of the scripts (complex syllables formed by various combinations of vowels, consonants and conjunct consonants), the highly inflectional nature of the language [1], and the errors that can occur within the context of a sentence.In some cases, an incorrectly recognized word may still be valid in the same language, which can further complicate the task of OCR.Moreover, due to the poor availability of language resources, having good accuracy can make it difficult to train the systems.
Currently, available OCR for Hindi are Tesseract OCR [2], Indsenz OCR which is a commercial product [3], eAksharaya OCR [4] and Multilingual OCR for Indian languages [5].Among these most reasonable OCRs are Tesseract OCR and Multilingual OCR for Indian languages.They showed around 89.29% and 91.6% word accuracy on the available Hindi OCR test dataset from IIITH, respectively [5][6][7].The word accuracy achieved by them needs substantial improvement, which can be achieved through post-processing of the OCR output.
Recent post-processing approaches that have been utilized to improve the accuracy of Hindi OCR are dictionary-based [8], statistical language models [8,9], Syllable and confusion matrix based [7], Deep learning LSTM [10], Word Embedding & Levenshtein distance [11] and sub-word embedding [12].However, these approaches do not address contextual errors.
To address this limitation, we propose a context-sensitive automatic error correction approach to improve the accuracy of Hindi OCR output.In this work, the goal is to correct errors found in scanned documents generated via Tesseract Hindi OCR system, specifically focusing on providing suggestions for incorrect words based on contextual information, as Tesseract is a freely available OCR providing reasonably good accuracy.
The paper is organized as follows: Section 1 briefly introduces challenges of Hindi OCR, emphasizes the significance of post-processing and briefly introduces the proposed error correction approach.Section 2 discusses the relevant existing work on OCR error correction approaches.The motivation behind the current work is outlined in section 3. Section 4 presents the methodology for developing an automatic error correction model for Hindi OCR.The experimental results and error analysis are presented in section 5. Lastly, the conclusion and future scope are mentioned in section 6.

RELATED WORK
Error correction is a necessary step to improve the accuracy of the OCR output text.This section discusses the various OCR error correction approaches used for post-processing the OCR output for Indian languages.
Post-processing techniques have been extensively used in the past to enhance OCR accuracy.Various OCR postprocessing techniques are explored by Nguyen et al. [13].These techniques can be grouped based on usage of (i) Dictionaries or Lexicon (ii) Syntactic or Semantic rules of the language/word morphology (iii) Statistical Language Models usually represented as n-grams etc., and (iv) Deep Learning approach.Some of the methods find the best alternative, while other methods find the top-n possible alternatives.Dictionary-based approach on the Hindi OCR recognized document showed a character accuracy of 93% [14] The problem with the dictionary-based approach is that it assumes the word suggested as true if it is present in the dictionary and also it cannot correct the contextual errors.
Syntactic and semantic-based approaches for error correction can be found here.A Morphological analyzer was utilized for Bangla (Indian script) to find the root words and provide correct suggestions [15].The corrections made here are based on a fast dictionary access method and suggest the correct word in 82.44% of cases.
The shape-based post-processing technique was applied to Gurmukhi (Indian script) and achieved a 3% improvement, attending to 97.34% character accuracy [16].Here, consonants are categorized into groups based on shape similarity, each assigned a unique number.The input word is encoded accordingly.If the code is found in the dictionary, a matching operation is executed between the input word and words stored under that specific code.In cases without a match, structurally similar words to the input are suggested as alternatives.This approach finds difficulty in providing suggestions to similar characters.
Locality Sensitive Hashing (LSH) was used to create word clusters to improve the Telugu language OCR accuracy [17], resulting in 79.12% word accuracy.Here the OCR outputs of word images within a cluster are compared, and improvements are achieved through methods such as Character Majority Voting or Dynamic Time Warping Technique.This technique fails to provide suggestions for unique words (words appearing only once in a cluster).
The Weighted Syllable-based Spell Correction Approach (WSSC) was applied to Hindi OCR-recognized text, resulting in 88.22% word accuracy [7].In this method, a syllable confusion matrix was generated, taking into account the frequency of each syllable substitution.Word was looked up in the dictionary to check whether it is correct or incorrect.If the word is incorrect, then based on the confusion matrix, appropriate syllables are substituted for all syllables in the words.Forming many words and assigning the weight to words based on the frequency of syllables.Again, these words were looked up in the dictionary for their correct combination, suggesting the best word based on the weight assigned.It will not suggest contextual suggestions.All these OCR postprocessors implement post-processing at the Unicode (or character) level.
A graph-based, sub-character level language model was applied to rectify OCR word errors in Malayalam [18].In this approach, edges represent the language information, and nodes represent visual similarities.The optimal path from source to destination provides suggested words.Through this approach word accuracy of 95.0% is achieved [18].This approach fails to provide suggestions for rare words, proper nouns and foreign words.
Statistical language models (SLM) and Dictionary-based methods have been used in error correction for Indian languages such as Hindi, Punjabi (Gurmukhi script), Telugu, and Malayalam [9].This approach reported Improvements in errors in ranges between 38% to 66% for the above languages.The limitation of this work is that a dictionary and character ngram alone cannot effectively suggest the correct word due to the inflectional nature of Indian languages.
Sub-word embeddings have also been applied for OCR error correction in Hindi, Sanskrit, Kannada, and Malayalam, resulting in 90.42% word accuracy for Hindi [12].Further intersection of error rectification and cluster analysis was observed in the clustering of similar words within documents for error correction, particularly in Hindi and English [8].This approach lacks contextual suggestion.
Spelling correction in OCR-generated Hindi text, utilizing methods like Word Embedding and Levenshtein Distance [11].This uses the Continuous Bag-of-Words (CBOW) model.It does not consider the order of words in a given context.Thus, it will struggle to provide contextual suggestions for incorrect words.
Recently, deep learning techniques were used for OCR error detection and correction spanning four languages-Hindi, Malayalam, Sanskrit, and Kannada-specifically an LSTM approach, particularly relevant when languages are written in distinct scripts [10].It showed 92% of word accuracy for Hindi.LSTMs can capture dependencies over longer sequences than simple RNNs, but they still have a limited context window.
The above literature review indicates that post-processing of OCR output for Indian languages is challenging due to context-dependent errors and the inherent nature of highly inflectional languages.The abundance of words due to the inflectional nature of the Indian language also poses a challenge in OCR error correction.

MOTIVATION
Literature reviews have shown that there is still scope for improving the OCR output of Indian languages.The review of the existing literature has indicated that post-processing in OCR output is crucial in enhancing the performance of Hindi OCR.However, the most commonly used approaches do not address context-sensitive error correction, as indicated by the three examples shown in Table 1.The red color indicates an incorrect word.Example 1 where the incorrect word is क ां .Here, the existing bigram model can show suggestions as क , के , की, etc.Likewise, in example 2, रनप is an incorrect word.Existing models can suggest a word like ब द, स थ, लिए etc.Similarly, in example 3 the incorrect word is एफटीआहँ आई, which is an English foreign word.For such cases, we need an approach that can suggest a word based on context.
This can be achieved using state-of-the-art techniques such as Masked Language Model (MLM) with Bidirectional Encoder Representations from Transformers (BERT), which have not yet been explored for OCR error correction in Indian languages.The MLM BERT model has been utilized for tasks such as detection [19][20][21] and correction [22], demonstrating its ability to simplify the model's architecture and provide contextual suggestions [23].MLM BERT masks words containing errors and offers suggestions for the masked word, based on available contextual information.To build the MLM BERT for Hindi, it is necessary to have a BERT tokenizer and MLM BERT model for Hindi.In this paper, an OCR error correction approach using an advanced MLM with BERT to suggest contextually aware correction is discussed in section 4.

METHODOLOGY
This section discusses two main parts of the experimental setup.The first part describes how the dataset was created for testing the error correction model.The second part provides details of the experimental setup.

Test dataset creation
The procedure to create datasets to evaluate the error correction model is discussed here.Two sets of test data were utilized to evaluate the performance of the proposed error correction model.The first dataset is the Hindi OCR dataset obtained from IIITH [5].It comprises a total of 100 pages presented in image format, along with their respective text files.For the sake of convenience, we have named this dataset as Hindi-IIITH-dataset.This dataset showed 91.6% word accuracy on OCR for Indian languages [5] and 89.29% on Tesseract Hindi OCR [7].The second dataset is created from a part of Hindi text within the same domain, which has been utilized for training in this work.The total number of words available in the Hindi-IIITH-dataset is 29884.The characteristics of this dataset is presented in Table 2 [25]) e.g.पल्लव (PaLLaVa).
Words with multiple modifiers: Words which have more than one modifier Ookar or Anusvara or Eekar and does not belong to any of the other categories e.g.मलिक (manIKA), लनत र (NITARA).

Experimental setup for error correction approach
BERT originally developed for two main tasks: next sentence prediction and MLM to predict the masked word in the sentence.The word to be predicted is masked with a token [MASK] and the model generates contextual suggestions for the masked word [30].In this proposed error correction approach, we initially mask the incorrect word, and further, the proposed error correction approach suggests the contextually appropriate word for the masked word.Figure 1  A detailed explanation is provided for each phase involved in the process: Ⅰ. Training BERT tokenizer: The BERT model requires the input data in a suitable format to obtain the corresponding embedding of the sentence.The input to the BERT model is a sentence of a fixed length.Each token was given a distinct ID when the BERT model was trained.So, in order to employ a pre-trained BERT model, each token in the input sentence must first be converted into its corresponding unique ID.Sentences are padded to achieve a fixed length for input.To obtain a unique ID for a token, a BERT tokenizer is needed.The BERT tokenizer will split the word into two subwords (token).The first token is a frequently seen word (prefix) in a corpus.Two hashes are prefixed to the second token to indicate that it is a suffix following some other subwords.Further, we can convert subwords to unique IDs using the BERT tokenizer.The BERT tokenizer for Hindi has been created using BertWordPeiceTokenizer and the respective data which is shown in Table 3. Ⅱ. Training MLM BERT model: Here the BertForMaskedLM architecture is employed specifically utilizing the BERT Model known as 'bert-base uncased'.Here we have chosen 'bert-base uncased' as Devanagari is an uncased script.This particular model consists of 12 layers, 768 hidden units, 12 attention heads, and a total of 110 million parameters.It was trained on lowercase English text.Despite utilizing this model, it was challenging to achieve satisfactory accuracy in correcting OCR errors in Hindi text.To support Hindi text, the BERT Dev-hindi tokenizer as created in Phase I, is used and a corpus for additional training of the pre-trained BERT model "bert-base-uncased" is also employed, which is the same as the one used during Phase I for constructing the tokenizer.The training parameters were configured as follows: Gaussian Error Linear Unit (GELU) as hidden activation function, the maximum position embeddings were limited to 512, the attention dropout rate was set to 0.1, the number of epochs was set to 3, the instantaneous batch size per device was 8, the gradient accumulation steps were set to 1, the number of sentences was 539134 and the total number of optimization steps was 50500.The GELU was chosen as an activation function as it is a high-performing neural network activation function [31].The choice of dropout rate and epoch configuration was influenced by a study demonstrating good performance results in the context of BERT [32].After training the training loss was 1.54.Ⅲ. Error Correction using MLM BERT: In this module, a lookup dictionary is utilized to identify incorrect words and the correction model suggests the appropriate word for every incorrect word.All the incorrect words were masked using [MASK] token.The word correction was achieved by utilizing the "Dev-Hindi" model and its corresponding tokenizer i.e. "Dev-Hindi".Initially, the input OCR text having masked token is tokenized using the "Dev-Hindi" tokenizer, resulting in token IDs, and the"Dev-Hindi" model generates a set of predictions for the word in the form of token IDs and their associated probabilities.These token IDs are then converted back to words using the "Dev-Hindi" tokenizer.Employing the Levenshtein distance for the top 6 candidate words, the similarity between the original token and the predicted token was assessed by selecting those having a similarity score of more than 0.8.Ultimately, the word with the highest probability among the suggested options is selected by the system as the most suitable word.An example is shown in Table 4.The word selected by the system can be either correct or incorrect.In some cases, the system may not provide any suggestions, and in those instances, the word remains unchanged.Examples for each of these cases are discussed in section 5.The correct words are indicated in green, the incorrect words are underlined and marked in red and the words for which no suggestions were provided are indicated in blue.

Performance of Hindi_Error_Correction_Model on Hin_test_dataset and Hindi-IIITH-dataset
The proposed error correction approach has been evaluated with Hin_test_dataset and Hindi-IIITH-dataset [5], the characteristics of which are detailed in Table 2.The ISRI analytic tool was used to calculate the word accuracy during the evaluation process [33].The following evaluation metrics are used: (i) Accuracy (word accuracy): Word accuracy is the percentage of the words recognized correctly from the total words in the ground truth dataset.
(ii) Word Error Rate (WER): WER is the percentage of words incorrectly recognized from the total words in the ground truth dataset.It is also equal to 100 minus word accuracy.

Dataset
Word Accuracy Tesseract OCR (Hindi) Hindi_Error_ Correction_Model Hin_test_dataset 87.23% 89.65% Hindi-IIITH-dataset [5] 89.29% 95.18%It was observed that Tesseract Hindi OCR word accuracy on Hin_test_dataset and Hindi-IIITH-dataset were 87.23% and 89.29%, respectively.On performing error correction using Hindi_Error_Correction_Model on the dataset Hin_test_dataset the word accuracy increased from 87.23% to 89.65%.Additionally, when conducting error correction on the Hindi-IIITH-dataset, using the Hindi_Error_Correction_Model, it was observed that word accuracy increased from 89.29% and 95.18%.as shown in Table 5.Therefore, it can be observed that there is an improvement in word accuracy on both testing dataset Hin_test_dataset and Hindi-IIITH-dataset.The difference in word accuracy between the Tesseract OCR results obtained from the Hin_test_dataset and the Hindi-IIITH-dataset [5] has been noticed.It observed the increase in word accuracy from 91.60% to 95.18%, having an improvement of 3.58% over the baseline Hindi OCR word accuracy [5].
Table 6 shows the category-wise WER for the output of the Tesseract OCR and Hindi_Error_Correction_Model.It is observed that Hindi-IIITH-dataset had more errors in words without any modifiers.This is because of the existence of punctuation mark "।" in the Hindi-IIITH-dataset which is constantly misclassified by Tesseract OCR output.Our proposed error correction approach corrects it, indicating proper punctuation marks for all the instances of the sentence.Thus, the Hindi_Error_Correction_Model corrects most of the words that have no modifiers.
Further, it is observed that there is a decrease in error for words with Ekar, Ookar, Conjunct consonants, and words with multiple modifiers.Words with anusvara error rate have remained the same.However it is important to note that these error words were masked and the Hindi_Error_Correction_Model's correction is based more on contextual information.

Error analysis
The results of this research work indicate that the proposed model delivers accurate recommendations, as demonstrated by the data presented in Table 7.For example, the actual word क (KO) is recognized as क ां (KAM) via tesseract OCR systems, whereas the Hindi_Error_Correction_Model corrects it to क (KO).

Table 7. Performance of Hin_test_dataset and Hindi-IIITHdataset
Actual word

Hindi_Error_Correction_
Let us consider another example to illustrate the effectiveness of the Hindi_Error_Correction_Model: The actual word बध ई (BaDhAI) is recognized by OCR as व ई (VAI), while the correction provided by the suggested model is बध ई.Additionally, the Hindi_Error_Correction_Model exhibits contextual-based corrections, as evidenced by the information provided in Table 8 and Table 9.For example, the actual text 6 हज र (6 HaJaRa) was incorrectly recognized as ह हज र (Ha HaJARa), and the proposed model suggests 10 हज र (HaJARa).However, the suggested word is incorrect.It is interesting to note that the suggested word is number and not any other word close to ह (Ha).
The Hindi_Error_Correction_Model also provides incorrect suggestions including cases where incorrect suggestions are given based on the context, as illustrated in Table 10.However, the suggested word is very similar to the correct word or contextually appropriate.Here is an example of the observation, i.e., the actual word लवच र ां (VICAROM) and OCR recognized word as लवच र (VICARO) and Hindi_Error_Correction_Model suggestion is लवच र (VICARO) which is a quite similar suggestion to the actual word.Another example of a similar suggestion is with actual word उद् घ टन (UDGhATaNa), OCR-recognized word उद्ध टन (UDGhaTaNa) and proposed model suggestion is उदघ टन (UDaGhATaNa).Furthermore, an instance that demonstrates contextualbased recommendations is when real words is इसे (ISE), OCR recognised word is हरने (HaRaNE) and model suggested the word as मह मलहम (maHAmaHIma), where the system suggested word was suitable for the sentence as shown in Table 11.And cases where suggestions were not offered for words are depicted in Table 12 and Table 13.For example, having actual word लहस्स (HISSA) and OCR recognized word लिरुस (DVIRUSA), the proposed model did not provide any suggestion for word लिरुस (DVIRUSA), result in retaining same word.Thus, it is observed that Hindi_Error_Correction_Model fails for numbers and mostly words whose first character of word is incorrectly recognized.Additionally, when the majority of neighboring words were incorrect, the Hindi_Error_Correction_Model fails to provide appropriate suggestions, as shown in Table 14.Nevertheless, the incorrect suggestions often align closely with the actual word or most likely based on context.It was also observed that in some cases, such as numbers, the MLM BERT model fails to provide appropriate suggestions.The model does not provide any suggestions especially when there exist two or more incorrect consecutive words.Also if most of the neighboring words are incorrect with very low edit distance measure then the model is not able to provide proper suggestions.Moving forward, future work will prioritize improvements through ensemble approaches and exploring better language models for Hindi OCR error correction.Also MLM BERT with other similarity functions can be explored.
describes the proposed system, which consists of three modules: I) Training BERT tokenizer, II) Training MLM BERT model, and III) Error Correction using MLM BERT.

Figure 1 .
Figure 1.Proposed hindi error correction systemTable 3. Data used for Training Dev-Hindi tokenizer and Hindi_Error_Correction_Model Language Total Words Source of Data

Table 2 .
Characteristics of data used to create dataset to test error correction model

Table 4 .
Example of Hindi_Error_Correction_Model output Hindi language

Table 8 .
Output of Hindi_Error_Correction_Model indicating suitable output as per context

Table 9 .
Output of Hindi_Error_Correction_Model showing correct suggestion

Table 10 .
Output of Hindi_Error_Correction_Model having

Table 13 .
Examples of Hindi_Error_Correction_Model output providing no suggestion

Table 14 .
Output of Hindi_Error_Correction_Model with incorrect suggestionIn this research paper, the problem of automatic error correction of OCR output of highly inflectional Hindi language is addressed.Error correction model utilizing stateof-art Masked Language Modeling (MLM) with BERT is proposed.The model selects the best word among the top 6 candidates based on assigned probabilities.The proposed automatic error correction model improved word accuracy by 3.58% over Tesseract OCR.There is more improvement in words without any modifiers.And negligible improvement for the words consisting of Ekar, Ookar, Conjunct consonant and words with multiple modifiers.Model outperforms its counterparts by providing context-sensitive suggestions.