This paper aims to study the effectiveness of the feature extraction model based on MFCC and Fast Fourier Transform (FFT). Using the CNN model, five basic emotions were extracted from the input speech corpus, and the spectrogram based on long-term speech words was applied to achieve the high-precision performance of the fixed-length learning vector existing in the audio file. Finally, the authors proposed the method of recognizing five emotional states in the FFT-based RAVDSS and SAVEE emotion speech corpus based on FFT. By comparison with the most advanced correlation methods, it’s found that the detection accuracy is improved by 70% when using the proposed model to extract audio fragments from audio files and adjust the speech words to spectrograms.
Mel-Frequency Cepstral Coefficients (MFCC), Fast Fourier Transform (FFT)-Based feature, CNN model and Hybrid HMM/CNN system