Une méthode dirigée par la syntaxe pour l’extraction de champs numériques dans les courriers entrants

Une méthode dirigée par la syntaxe pour l’extraction de champs numériques dans les courriers entrants

A syntax directed method for numerical field extraction in incoming mail documents

Clément Chatelain Guillaumme Koch  Laurent Heutte  Thierry Paquet 

Laboratoire PSI Université de Rouen 76821 Mt St Aignan cedex

Corresponding Author Email: 
clement.chatelain@univ-rouen.fr
Page: 
179-198
|
Received: 
28 February 2006
| |
Accepted: 
N/A
| | Citation

OPEN ACCESS

Abstract: 

In this article, we propose a generic method for the automatic localisation and recognition of numerical fields (phone number, ZIP code, etc.) in unconstrained handwritten incoming mail documents. The method exploits the syntax of a numerical field as an a priori knowledge to locate it in the document. A syntactical analysis based on Markov models filters the connected component sequences that respect a particular syntax known by the system. Once extracted, the fields are submitted to a numeral recognition process. Hence, we avoids an integral recognition of the document, which is a very tough and time consuming task. We show the efficiency of the method on a real incoming mail document database.

Résumé

Dans cet article, nous présentons une méthode générique d'extraction et de reconnaissance de champs numériques (numéro de téléphone, code postal, etc.) dans des courriers manuscrits non contraints. La méthode d'extraction exploite la syntaxe des champs comme information a priori pour les localiser. Un analyseur syntaxique à base de modèles de Markov filtre les séquences de composantes qui respectent la syntaxe d'un type de champ connu du système. Notre approche permet ainsi d'éviter la reconnaissance totale du document, opération délicate et coûteuse en temps de calcul, puisque seuls les champs localisés sont soumis à un système de reconnaissance. Nous montrons l'efficacité de la méthode sur une base de courriers manuscrits réels de type courrier entrant.

Keywords: 

Handwriting recognition, information extraction, numerical field, neural networks, markovian model

Mots clés

Reconnaissance de l'écriture manuscrite, extraction d'information, champ numérique, réseaux de neurones, modèles de Markov

1. Introduction
2. Analyse Du Problème Et Justification De La Méthode
3. Formalisation Du Problème
4. Description Générale De La Méthode
5. Segmentation En Lignes
6. Classification Des Composantes
7. Analyseur Syntaxique
8. Reconnaissance Des Champs
9. Résultats
10. Conclusion Et Perspectives
  References

[1] SALTON G, “Automatic information organization and retrieval”, Mac Graw Hill Book Co., NY, 1968.

[2] FORNEY D.D, “The Viterbi algorithm”, Proc. IEEE, vol. 61, 1973, pp. 268-278.

[3] LECUN Y., B. BOSER, J.S. DENKER, D. HENDERSON, R.E. HOWARD, W. HUBBARD et L.D. JACKEL, “Backpropagation applied to handwritten zip code recognition”, Neural Computation, vol. 1, no. 4, 1989, pp. 541-551.

[4] RABINER L.R., “A tutorial on hidden markov models ans selected applications in speech recognition”, in Readings in Speech Recognition. Kaufmann, 1990, pp. 267-296.

[5] RICHARD M.D. et R.P. LIPPMANN, ‘Neural network classifiers estimate bayesian a posteriori probabilities”, Neural Computation, vol. 3, 1991, pp. 461-483.

[6] XU L., A. KRYZAK, C.Y. SUEN et K. LIU, “Method of combining multiple classifiers and their applications to handwritting recognition”, IEEE Trans. on SMC, vol. 22, n°3, 1992, pp. 418-435.

[7] KIMURA F., S. TSURUOKA, Y. MIYAKE et M. SHRIDHAR, “A lexicon directed algorithm for recognition of unconstrained handwritten words”, IEICE Trans. on Information & Syst., vol. E77-D, no. 7, 1994, pp. 785-793.

[8] BISHOP C.M., Neural Networks for Pattern Recognition. Oxford University Press, 1995.

[9] CONGEDO G., G. DIMAURO, S. IMPEDOVO et G. PIRLO,“Segmentation of numeric strings”, ICDAR’95, vol. 2, 1995, pp.1038-1041.

[10] LIKFORMAN-SULEM L. et C. FAURE, “Une méthode de résolution des conflits d’alignements pour la segmentation des documents manuscrits”, Traitement du Signal, vol. 12, 1995, pp. 541-549.

[11] CASEY R. et E. LECOLINET, “A survey of methods ans strategies in character segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 7, 1996, pp. 690-706.

[12] TRIER O.D., A.K. JAIN et T. TAXT, “Feature extraction methods for character recognition: A survey”, Pattern Recognition, vol. 29, no. 4, 1996, pp. 641-662.

[13] DZUBA G., A. FILATOV, D. GERSHUNY, I. KIL et V. NIKITIN, Check amount recognition based on the cross validation of courtesy and legal amount fields. Automatic Bank Check Processing. World Scientific 1997, pp. 177-194.

[14] GORSKI N., “Optmizing error-reject trade off in recognition systems”, ICDAR’97, vol. 2, 1997, pp. 1092-1096.

[15] KIM G. et V. GOVINDARAJU, “A lexicon driven approach to handwritten word recognition for real-time applications”, IEEE Trans. on PAMI, vol. 19, no. 4, 1997, pp. 366-378.

[16] HEUTTE L., T. PAQUET, J.V. MOREAU, Y. LECOURTIER et C. OLIVIER, “A structural/statistical feature based vector for handwritten character recognition”, Pattern Recognition Letters, vol. 19, 1998, pp. 629-641.

[17] KIM G. et V. GOVINDARAJU, “Handwritten phase recognition as applied to street name images”, Pattern Recognition, vol. 31, no. 1, 1998, pp. 41-51.

[18] DEY S., “Adding feedback to improve segmentation and recognition of handwritten numerals”. Master’s thesis, Massachusetts Institute of Technology, 1999.

[19] LORETTE G., “Handwritting recognition or reading ? what is the situation at the dawn of the 3rd millenium?”, IJDAR, vol. 2, no. 1, 1999, pp. 2-12.

[20] JAIN A.K., R.P.W. DUIN et J. MAO, “Statistical pattern recognition: A review”, IEEE Trans. on PAMI, vol. 22, no. 1, 2000, pp. 4-37.

[21] PLAMONDON R. et S.N. SHIHARI, “On-line and off-line handwritting recognition: A comprehensive survey”, IEEE Trans. on PAMI, vol. 22, no. 1, 2000, pp. 63-84.

[22] EL-YACOUBI A., M. GILLOUX et J.-M. BERTILLE, “A statistical approach for phase location and recognition within a text line: An application to street name recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, 2002, pp. 172-188.

[23] MORITA M. R. SABOURIN, F. BORTOLOZZI and C.Y. SUEN,“Segmentation and recognition of handwritten dates”, IWFHR, 2002, pp. 105-110.

[24] ZOUARI H., L. HEUTTE, Y. LECOURTIER et A. ALIMI, « Un panorama des méthodes de combinaison de classifieurs en reconnaissance de formes », RFIA’2002, vol. 2, 2002, pp. 499-508.

[25] PAL U., A. BELAÏD et C. CHOISY, “Touching numeral segmentation using water reservoir concept”, Pattern Recognition Letters, vol. 24, 2003, pp. 261-272.

[26] PITRELLI J.F. et M.P. PERRONE, “Confidence-scoring postprocessing for off-line handwritten-character recognition verification”, ICDAR’03, vol. 1, 2003, pp. 278-282.

[27] PREVOST L., C. MICHEL-SENDIS, A. MOISES, L. OUDOT et M. MILGRAM, “Combining model-based and discriminative classifiers: application to handwritten character recognition”, ICDAR’03, vol. 1, 2003, pp. 31-35.

[28] RAHMAN A.F.R. et M.C. FAIRHURST, «Multiple classifier decision combination strategies for character recognition: A review”, IJDAR, vol. 5, 2003, pp. 166-194.

[29] VINCIARELLI A., S. BEGO et H. BUNKE, “Offline recognition of unconstrained handwritten texts using hmms and statistival language models”, IEEE Trans. on PAMI, vol. 26, no. 6, 2004, pp. 709-720.