OPEN ACCESS
This paper presents an experimental evaluation of different features for use in speaker identification (SID). The features are tested using speech data provided by the EUROM1 database, in a text-independent closed-set speaker identification task. The main objective of the paper is to present a novel parameterization of speech that is based on an auditory model called Auditory Image Model (AIM). This model provides features of the speech signal and their utility is assessed in the context of speaker identification. In order to explore the features that are more informative for predicting a speaker’s identity, the auditory image is used within the framework of cutting it into rectangles. Then, a novel strategy is incorporated for the enrolment of speakers, which is used for specifying the regions of the image that contain features that make a speaker discriminative. Afterwards, the new speaker-specific feature representation is assessed in noisy conditions that simulate a real-world environment. Their performance is compared with the results obtained adopting MFCC features in the context of a Vector Quantization (VQ) classification system. The results for the identification accuracy suggest that the new parameterization provides better results compared to conventional MFCCs especially for low SNRs.
Auditory image model, Speaker identification, Feature extraction
[1] R.D. Patterson, M.H. Allerhand, C. Giguere, Time-domain modelling of peripheral auditory processing: A modular architecture and a software platform, 1995, JASA, vol. 98, pp. 1890–1894.
[2] D.A. Reynolds, An overview of automatic speaker recognition technology, 2002, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP’02), vol. IV, pp. 4072-4075.
[3] P. Rose, Forensic Speaker Recognition, 2002, Taylor and Francis, Inc., New York.
[4] S. Bleeck, Ives, T. & Patterson, R.D. (2004). Aim-mat: The auditory image model in MATLAB. Acta Acustica, 90, pp. 781–787.
[5] R.D. Patterson, K. Robinson, J. Holdsworth, Complex sounds and auditory images. In: Auditory physiology and perception. Y. Cazals, L. Demany, K. Horner (eds.), 1992, Pergamon, Oxford, pp. 429–446.
[6] R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, G. Chechik, Sound retrieval and ranking using sparse auditory representations. Neural computation, 2010, vol. 22, pp. 2390–416.
[7] D. Chan, A. Fourcin, D. Gibbon, B. Granstrom, M. Huckvale, G. Kokkinakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno, J. Mouropoulos, F. Senia, I. Trancoso, C. Veld, J. Zeiliger, EUROM- A spoken language resource for the EU, 1995, Eurospeech'95 Proceedings of the 4th European Conference on Speech Communication and Speech Technology, Madrid, Spain, 18-21 September. vol. 1, pp. 867-870.
[8] X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, S. Shamma, Linear versus Mel-frequency cepstral coefficients for speaker recognition, 2011, Proceedings of IEEE Workshop on ASRU, pp. 559-564.