Choice and adaptation of statistical models for single channel singing voice separation.
Choix et adaptation de modèles statistiques pour la séparation de voix chantée à partir d’un seul microphone
OPEN ACCESS
The problem of singing voice extraction from mono audio recordings, i.e.,one microphone separation of voice and music,is studied.The approach is based on a priori probabilistic models for two sources,more precisely on Gaussian Mixture Models (GMM).A method for model adaptation to the characteristics of the mixed sources is developed and a comparative study of different models and estimators is performed.We show that the adaptation of the model of music from the non-vocal parts of songs yields good results in realistic conditions.
Résumé
Le problème de l’extraction de la voix chantée dans des enregistrements musicaux monophoniques, c’est-à-dire la séparation voix / musique avec un seul capteur,est étudié. Les approches utilisées sont basées sur des modèles statistiques a priori des deux sources (musique et voix),notamment sur des Modèles de Mélange de Gaussiennes (MMG). Une méthode d’adaptation des modèles aux caractéristiques des sources mélangées est proposée,et une étude comparative des différents modèles et estimateurs est effectuée. Les résultats montrent que l’adaptation du modèle de musique sur les parties non-vocales des chansons permet d’obtenir de bonnes performances dans un cadre réaliste.
Single channel source separation,singing voice,statistical models,Gaussian mixture models,adaptive Wiener filtering, models adaptation.
Mots clés
Séparation de sources avec un seul capteur,voix chantée,modèles statistiques,modèles de mélange de gaussiennes,filtrage de Wiener adaptatif,adaptation de modèles.
[1] S.T. ROWEIS. “One microphone source separation”, in Advances in Neural information Processing Systems,vol. 13,MIT Press,2001,pp. 793-799.
[2] L. BENAROYA. « Séparation de plusieurs sources sonores avec un seul microphone », Ph.D. dissertation, Université de Rennes 1, 2003.
[3] Y. EPHRAIM and D. MALAH. “Speech enhancement using a minimum mean square error log-spectral amplitude estimator”, in IEEE Trans. on Acoust., Speech, and Sig. Proc., vol. ASSP-33, Apr 1985, pp. 443-445.
[4] D. BURSHTEIN and S. GANNOT. “Speech enhancement using a mixture-maximum model”, in European Conf. on Speech Communication and Technology (EuroSpeech’99), ol. 6, Budapest, Hungary, Sep 1999, pp. 2591-2594.
[5] S. M. KAY. Fundamentals of Statistical Signal Processing, Estimation Theory. Prentice Hamm, 1993.
[6] G. PEETERS and X. RODET. “SINOLA: A new analysis/synthesis method using spectrum peak shape distortion, phase and reassigned spectrum”, in International Computer Music Conference (ICMC’99), Oct. 1999, pp. 153-156.
[7] A. P. DEMPSTER, N. M. LAIRD, and D. B. RUBIN. “Maximum likehood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, pp. 1-38, 1977.
[8] L. BENAROYA and F. BIMBOT. “Wiener based source separation with HMM/GMM using a single sensor,” in Intl. Conf. on Indep. Component Analysis and Blind Source Separation (ICA’03), Nara, Japan,Apr. 2003, pp. 957-961.
[9] J. MCQUEEN. “Some methods for classification and analysis of multivariate obervations,” in 5th Berkeley Symposium on mathematics, Statistics and Probability, 1967, pp. 281-298.
[10] L. BENAROYA, F. BIMBOT, and R. GRIBONVAL. “Audio source separation with a single sensor,” IEEE Trans. Audio, Speech and Language Proc., vol. 14, no. 1, pp. 191-199, january 2006.
[11] F. D. NEESER AND J. L. MASSEY. “ Proper complex random processes with applications to information theory, IEEE Trans. inform. Theory, vol. 39, no. 4, pp. 1293-1302, July 1993.
[12] B. PICINBONO. “Second-order complex random vectors and normal distributions,” IEEE Trans. Signal Processing, vol. 44, no. 10, pp. 2637-2640, October 1996.
[13] W. H. PRESS, B. P. FLANNERY, S. A. TEUKOLSKY, and W. T. VETTERLING. Numerical Recipes in C. : The Art of Scientific Computing, 2nd ed. Cambridge University Press, October 1992. [Online]. Available: http://www.library.cornell.edu/nr/bookcpdf.htlm
[14] Y. EPHRAIM. “A bayesian estimation approach for speech enhancement using hidden Markov models,” IEEE Trans. Signal Processing, vol. SP-40, pp. 725-735,April 1992.
[15] L.R. RABINER. “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[16] A. NÁDAS, D. NAHAMOO, and M. A. PICHENY. “Speech recognition using noise-adaptive prototype,” in IEEE Trans. on Speech and Audio Proc., 1989, pp. 1495-1505.
[17] P.J. MORENO, B. RAJ, and R. M. STERN. “A vector taylor series approach for environment-independent speech recognition,” in IEEE intl. Conf. on Acoustics, Speech and Signal Proc. (ICASSP’96), vol. 2, 1996.
[18] R. GRIBONVAL, L. BENAROYA, E. VINCENT, and C. FÉVOTTE. “Proposals for performance measurement in source separation,” in Intl. Conf. Indep. Component Analysis and Blind Source Separation (ICA’03),April 2003, pp. 763-768.
[19] J.-M. VALIN, J. ROUAT, and F. MICHAUD. “Microphone array post-filter for separation of simultaneous non-stationary sources,” in IEEE Intl. Conf. on Acoustics, Speech and Signal Proc. (ICASSP’04), 2004.
[20] L. RABINER and B.-H. JUANG. Fundamentals of speech recognition. Englewood Cliffs, n.J.: Prentice Hall, 1993.
[21] R. VERGIN, D. O’SHAUGHNESSY, and A. FARHAT. “Generalized mel frequency cepstral coefficients for large-vocabulary speakerindependant continuous-speech recognition,” IEEE Trans. on Speech and Audio Proc., vol. no. 5, pp. 525-532, Sep 1999.
[22] T. KRISTJANSSON, H. ATTIAS, and J. HERSHEY, “Single microphone source separation using high resolution signal recontruction,” in IEEE Intl. Conf. on Acoustics, Speech and Signal Proc. (ICASSP’04), vol. 2, 2004, pp. 817-820.
[23] A. OZEROV, P. PHILIPPE, R. GRIBONVAL, and F. BIMBOT. “One microphone singing voice separation using source-adapted models,” in IEEE Worksh. on Apps. of Signal Processing to Audio and Acoustics (WASPAA’05), Mohonk, NY, Oct. 2005, pp. 90-93.
[24] E. VINCENT and R. GRIBONVAL. « Construction d’estimateurs oracles pour la séparation de sources », in GRETSI’05 Syposium on Signal and Image Processing, Louvain-la-Neuve, Belgium, 2005.
[25] A. OZEROV, R. GRIBONVAL, P. PHILIPPE, and F. BIMBOT. « Séparation voix/musique à partir d’enregistrements mono quelques remarques sur le choix et l’adaptation des modèles », in GRETSI’05 Symposium on Signal and Image Processing, Louvain-la-Neuve, Belgique, Sept. 2005.
[26] W.-H. TSAI, D. ROGERS, and H.-M. WANG. “Blind clustering of popular music recordings based on singer voice characteristics,” Computer Music Journal, vol. 28, no. 3, pp. 68-78, 2004.