Non Linear and Discriminant Feature Extraction Applied to Phonemes Recognition. Extraction de Caractéristiques Non Linéaire et Discriminante: Application À la Classification de Phonèmes

Non Linear and Discriminant Feature Extraction Applied to Phonemes Recognition

Extraction de Caractéristiques Non Linéaire et Discriminante: Application À la Classification de Phonèmes

Bruno Gas Mohamed Chetouani  Jean Luc Zarader 

Université Pierre et Marie Curie-Paris 6, Groupe Perception et Réseaux Connexionnistes, EA 2385, Ivry sur Seine, F-94200 France

Page: 
39-58
|
Received: 
1 October 2005
|
Accepted: 
N/A
|
Published: 
28 February 2007
| Citation

OPEN ACCESS

Abstract: 

In this article,we propose to study a speech coding method applied to the recognition of phonemes.The proposed model (the Neural Predictive Coding,NPC) and its two declinations (NPC-2 and DFE-NPC) is a connectionist model (multilayer perceptron) based on the non linear prediction of the speech signal.We show that it is possible to improve the discriminant capacities of such an encoder with the introduction of signal membership class information as from the coding stage.As such,it fits in with the category of DFE encoders (Discriminant Features Extraction) already proposed in literature.In this study we present a theoretical validation of the model in the hypothesis of unnoised signals and gaussian noised signals.NPC performances are compared to that obtained with traditional methods used to process speech on the Darpa Timit an Ntimit speech bases.Simulations presented here show that the classification rates are clearly improved compared to usual methods,in particular regarding phonemes considered difficult to process.A small vocabulary word recognition experiment is provided to show how NPC features can be used in a more conventional speech ANN-HMM based system approach.

Résumé

Nous proposons dans cet article une nouvelle méthode d'extraction de caractéristiques appliquée à la reconnaissance de phonèmes. Le modèle proposé:le codage neuronal prédictif (NPC pour Neural Predictive Coding) et ses deux déclinaisons NPC-2 et DFE-NPC (Discriminant Feature Extraction - NPC),est un modèle connexionniste de type perceptron multicouches (PMC) basé sur la prédiction non linéaire du signal de parole. Nous montrons qu'il est possible d'améliorer les capacités discriminantes d'un tel codeur en exploitant des informations de classe d'appartenance phonétique des signaux dès l'étape d'analyse. À ce titre,il entre dans la catégorie des extracteurs DFE déjà proposés dans la littérature.  Dans cette étude,nous présentons une validation théorique du modèle dans l'hypothèse de signaux respectivement non bruités et bruités (bruit additif gaussien). Les performances de l'extracteur NPC pour la classification de phonèmes sont comparées avec celles obtenues par les méthodes traditionnellement utilisées en extraction de caractéristiques sur des signaux des bases Darpa Timit et Ntimit. Les simulations présentées montrent que les taux de reconnaissance sont nettement améliorés,en particulier dans le cas de phonèmes de la langue anglaise fréquents mais réputés délicats à catégoriser. Enfin,une application en reconnaissance de mots isolés et petit vocabulaire est présentée dans le but de montrer comment l'on peut insérer les paramètres NPC dans une application de reconnaissance à l'aide d'un système mixte ANN-HMM (Artificial Neural Networks – Hidden Markov Models).

Keywords: 

Speech feature extraction,prédictive neural networks,nonlinear signal processing,phonemes recognition.

Mots clés

Extraction de caractéristiques,réseaux de neurones prédictifs,traitement non linéaire du signal,reconnaissance de phonèmes.

1. Introduction
2. L’état de l’art en Extraction de Caractéristiques pour la Classification
3. CodagePrédictif Neuronal: le Modèle NP
4. Extraction Discriminante de Caractéristiques
5. Analyse des Modèles
6. Conclusion
  References

[1] University of Pennsylvania Linguistic Data Consortium. The nist darpa-timit acoustic-phonetic continuous speech corpus: a multi speakers data base, 1990. 

[2] K. AIKAWA, H. SINGER, H. KAWAHARA, and Y. TOHKURA, A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition. In International Conference on Speech and Signal Processing (ICASSP), volume 2, pp. 668-671, 1993. 

[3] B.S. ATAL and S. L. HANAUER, Speech analysis and synthesis by linear prediction of the speech wave. Journal of the Acoustical Society of America, 50:637-655, 1971. 

[4] M. BACCHIANI and K. AIKAWA, Optimization of time-frequency masking filters using the minimum classification error criterion. In International Conference on Speech and Signal Processing (ICASSP), volume 2, pp. 197-200, 1994. 

[5] A. BIEM, Neural models for extracting speaker characteristics in speech modelization system. PhD thesis, Paris VI, 1997. 

[6] A. BIEM and S. KATAGIRI, Feature extraction based on minimum classification error/generalized probabilistic descent method. In International Conference on Speech and Signal Processing (ICASSP), volume 2, pp. 275-278, 1993. 

[7] A. BIEM and S. KATAGIRI, Filter bank design based on discriminative feature extraction. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 485-488, 1994. 

[8] M. BIRGMEIER, Nonlinear prediction of speech signals using radial basis function networks. In Proceedings of EUSIPCO96, pp. 459462, September 1996. 

[9] C. M. BISHOP, Novelty detection and neural network validation. In IEE proceedings: Vision, Image and Signal Processing. Special Issue on applications of neural networks, volume 141, pp. 217-222, 1994. 

[10] C. M. BISHOP, Neural Networks for Pattern Recognition. Clarendon Press - Oxford, 1995. 

[11] H. BOURLARD, H. HERMANSKY, and N. MORGAN, Towards increasing speech recognition errors. Speech Communication, 18:205-231, 1996. 

[12] H. BOURLARD and N. MORGAN, Hybrid hmm/ann systems for speech recognition: Overview and new research directions. Lecture Notes In Computer Science, 1387:389-417, 1997. 

[13] H. BOURLARD and Y. KAMP, Auto-association by multilayer perceptron and singular value decomposition. Biological Cybernetics, 59:291-294, 1988. 

[14] T. BURROWS, Speech Processing with Linear and Neural Network Models. PhD thesis, Cambridge University, 1996. 

[15] M. CHETOUANI, B. GAS, and J. L. ZARADER, Maximisation of the modelisation error ratio for neural predictive coding. In NOLISP’03 (ISCA Tutorial and Research Workshop on NOn-Linear Speech Processing), pp. 77-80, 2003. 

[16] P. CHEVALIER, P. DUVAUT, and B. PICINBONO, Le filtrage de volterra transverse réel et complexe en traitement du signal. Traitement du Signal, 7(5):451-476, 1990. 

[17] S. B. DAVIS and P. MELMERSTEIN, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustic, Speech and Signal Processing, 28(4):357-376, 1980. 

[18] A. DE LA TORRE,A. M. PEINADO,A. J. RUBIO,V. E. SÁNCHEZ, and J. E. DÍAZ, An application of minimum classification error to feature space transformations for speech recognition. Speech Communication, 20:273-290, 1996. 

[19] F. DÍAZ -DE-MARÍA and A. R. FIGUEIRAS-VIDAL, Nonlinear prediction for speech coding using radial basis functions. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 788-791, 1995. 

[20] G. DREYFUS, O. MACCHI, S. MARCOS, O. NERRAND, L. PERSONNAZ, ROUSSEL-RAGOT, D. URBANI, and C. VIGNAT,Adaptive training of feedback neural networks for non linear filtering. Neural Networks for signal processing, 2:550-559, 1992. 

[21] M. FAUNDEZ-ZANUY and A. ESPOSITO, Nonlinear speech processing applied to speaker recognition. In Conf. on the Advent of Biometrics on the Internet, 2002. 

[22] S. FURUI, Speaker-independant isolated word recognition using dynamic features of speech spectrum. The Journal of the Acoustical Society of America, pp. 1738-1752, 1986. 

[23] B. GAS,J. L. ZARADER,and C. CHAVY,A new approach to speech coding : The neural predictive coding. Journal of Advanced Computational Intelligence, 4(1):120-127, 2000. 

[24] B. GAS, J. L. ZARADER, C. CHAVY, and M. CHETOUANI, Discriminant neural predictive coding applied to phoneme recognition. Neurocomputing, 56:141-166, 2004. 

[25] T. GAUTAMA, D.P. MANDIC, and M.M VAN HULLE, On the characterisation of the deterministic/stochastic and linear/nonlinear nature of time series. Technical Report DPM-04-5, Imperial College London, 2004. 

[26] F. GIROSI and T. POGGIO, Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Computation, 1(4):465469, 1989. 

[27] Y. GONG, Speech recognition in noisy environments: A survey. Speech Communication, 16:261-291, 1995. 

[28] R. HECHT-NIELSEN, Kolmogorov’s mapping neural network existence theorem. In Proceedings of the International Conference on Neural Networks, pp. 11-13, 1987.

[29] H. HERMANSKY, Perceptual linear predictive (plp) analysis of speech. Journal of the Acoustical Society of America, 87(4):17381752, 1990. 

[30] H. HERMANSKY, Should recognizers have ears ? Speech Communication, 25:3-27, 1998. 

[31] H. HERMANSKY and N. MORGAN, Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2:587-589, 1994. 

[32] F. ITAKURA, Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustic, Speech and Signal Processing, 23:67-72, 1975. 

[33] F. ITAKURA and S. SAITO, Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustic, pp. 17-20, 1968. 

[34] B. H. JUANG and S. KATAGIRI, Discriminative learning for minimum error classification. IEEE Transactions on Signal Processing, 40(12):3043-3054, december 1992. 

[35] S. KARAJEKAR, Analysis of variability in Speech with Applications to Speech and Speaker Recognition. PhD thesis, OGI, Portland, USA, 2002. 

[36] S. KATAGIRI, Handbook of Neural Networks for Speech Processing. Artech House, 2000. 

[37] H. KATSUURA and D. A. SPRECHER, Computational aspects of kolmogorov’s superposition theorem. Neural Networks, 7(3):455461, 1994. 

[38] T. KAWAHARA and S. DOSHITA, Phoneme recognition by combining discriminant analysis and hmm. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 557-560, 1991. 

[39] T. KAWAHARA, T. OGAWA, S. KITAZAWA, and S. DOSHITA, Phoneme recognition by combining bayesian linear discriminations of selected pairs of classes. In International Conference on Speech and Signal Processing (ICASSP), p. 78, 1990. 

[40] G. I. KECHRIOTIS and E. S. MANOLAKOS, Using neural networks for nonlinear and chaotic signal processing. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 465-468, 1993. 

[41] D. J. KERSHAW, Phonetic Context-Dependancy In a Hybrid ANN/HMM Speech Recognition System. PhD thesis, St. John’s College, University of Cambridge, 1997.

[42] A. KOLMOGOROV, On the representation of continuous functions of many variables by superpositions of continuous functions of one variable and addition. Doklady Akademii Nauk USSR, 114(5):953956, 1957. 

[43] G. KUBIN, Speech coding and synthesis, chapter Nonlinear processing of Speech, pp. 557-609. W.B. Kleijn and K.K. Paliwal Editors, Elsevier Science, 1995. 

[44] V. KURKOVA, Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5:501-506, 1992. 

[45] K. J. LANG, A. H. WAIBEL, and G.E. HINTON, A time-delay neural network architecture for isolated word recognition. Neural Networks, 3:23-43, 1990. 

[46] A. LAPEDES and R. FARBER, Nonlinear signal processing using neural networks: Prediction and system modelling. Internal Report, Los Alamos National Laboratory, july 1987. 

[47] J. H. LEE, H. Y. JUNG, T. W. LEE, and S. Y. LEE, Speech feature extraction using independent component analysis. In International Conference on Speech and Signal Processing (ICASSP), volume 3, pp. 1631-1634, 2000. 

[48] J. C. LUCERO, A theoretical study of the hysteresis phenomenon at vocal fold oscillation onset-offset. Journal of Acoustic Society of America, 1:423-431, 1999. 

[49] N. MA, T. NISHI, and G. WEI, On a code-excited nonlinear predictive speech coding (cenlp) by means of recurrent neural networks. IEICE Transactions fundamentals, spec. issue on digital signal processing, E81-A(8):1628-1634, 1998. 

[50] N. MA and G. WEI, Speech coding with nonlinear local prediction model. In International Conference on Speech and Signal Processing (ICASSP), volume 2, pp. 1101-1104, 1998. 

[51] P. M. MARAGOS and A. POTAMIANOS, Fractal dimensions of speech sounds: computation and application to automatic speech recognition. Journal of Acoustic Society of America, 3:1925-1933, 1999. 

[52] P. J. MORENO and R. N. STERN, Sources of degradation of speech recognition in the telephone network. In International Conference on Speech and Signal Processing (ICASSP),volume 1,pp. 109-112,1994. 

[53] A. PAGÈS-ZAMORA, M. A. LAGUNAS, M. NÁJAR, and A.I. PÉREZ-NEIRA, The k-filter: A new architecture to model and design non-linear systems from kolmogorov’s theorem. Signal Processing, 44:249-267, 1995. 

[54] D. POVEY, B. KINGSBURY, L. MANGU, G. SAON, H. SOLTAU, and G. ZWEIG, fmpe: Discriminatively trained features for speech recognition. In Proc. of DARPA EARS RT-04 Workshop, 2004. 

[55] D. POVEY and P.C. WOODLAND, Minimum phone error and i-smoothing for improved discriminative tranning. In International Conference on Speech and Signal Processing (ICASSP), 2002. 

[56] V. C. RAYCAR, B. YEGNANARAYANA, and S. R. DURAISWAMY, Speaker localization using excitation source information in speech. IEEE Transactions on Speech and AUdio Processing, 2004. 

[57] W. REICHL, S. HARENGEL, F. WOLFERSTETTER, and G. RUSKE, Neural networks for nonlinear discriminant analysis in continuous speech recognition. In Eurospeech, pp. 537-540, 1995. 

[58] G. SAON, M. PADMANABHAN, R. GOPINATH, and S. CHEN, Maximum likelihood discriminant feature spaces. In International Conference on Speech and Signal Processing (ICASSP), volume 2, pp. 1229-1132, 2000. 

[59] J. SCHOENTGEN, Non-linear signal representation and its application to the modelling of the glottal waveform. Speech communication, 9:189-201, 1990. 

[60] J. SCHOENTGEN, On the bandwidth of a shaping function model of the phonatory excitation signal. In No Linear Speech Processing Workshop (NOLISP’03), 2003. 

[61] S.S. STEVENS and J. WOLKMANN, The relation of pitch of frequency: A revised scale. American Journal of Psychology., 53:329353, 1940. 

[62] H. STRIK, Automatic parametrization of differentied glottal flow: comparing methods by means of synthetic flow pulses. Journal of Acoustic American society, 5:2659-2669, may 1998.

[63] H. M. TEAGER and S. M. TEAGER, Evidence for non linear sound production mechanisms in the vocal tract. Speech Production and Speech Modeling, 55:241-261, July 1989. 

[64] J. THEILER, S. EUBANK,A. LONGTIN, B. GALDRIKIAN, and J. FARMER, Testing for nonlinearity in time series: the method of surrogate data. Physica D, 58:77-94, 1992. 

[65] J. THYSSEN, H. NIELSEN, and S. D. HANSEN, Non-linear shortterm prediction in speech coding. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 185-188, 1994. 

[66] B. TOWNSHEND, Non linear prediction of speech. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 425-428, 1991. 

[67] A. G. VITUSHKIN, On hilbert’s thirteenth problem. Dokl. Akad. Nauk. SSSR, 95:701-704, 1954. 

[68] A. H. WAIBEL, T. HANAZAWA, G. E. HINTON, K. SHIKANO, and K. J. LANG, Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustic, Speech, and Signal processing, 37(3):328-339, march 1989. 

[69] A. R. WEBB, Functional approximation by feed-forward networks: a least square approach to generalisation. IEEE Transactions on Neural Networks, 5(3):363-371, 1994. 

[70] B. WIDROW, 30 years of adaptative neural networks: Perceptron, madaline and backpropagation. Proc. of the IEEE, 78:1415-1442, 1990. 

[71] D. YUK and J. FLANAGAN, Telephone speech recognition using neural networks and hidden markov models. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 157-160, 1999. 

[72] S. A. ZAHORIAN, D. QIAN, and A.J. JAGHARGHI, Acoustic-phonetic transformations for improved speaker-independent isolated word recognition. In International Conference on Speech and Signal Processing (ICASSP), volume 1, pp. 561-564, 1991. 

[73] E. ZWICKER and E. TERHARDT, Analytical expressions for critical band rate and critical bandwidth as a function of frequency. The journal of The Acoustical Society of America, 68:1523-1525, 1980.