Effect of Data Augmentation, Cross-Validation Methods in Robustness of Explainable Speech Based Emotion Recognition

ABSTRACT


INTRODUCTION
Speech is the most natural and convenient medium to express.This is the main reason why now a day's speech assistants such as Alexa, Siri, OkGoogle have emerged to interact with machine.To make interaction, more natural exhaustive research is carried in the domain of emotion detection based on speech.SER is very challenging task as design or selection of each and every stage of its workflow right from identification of emotion discriminant features to selection of appropriate classifier is crucial [1].SER has applications in digital assistants [2], psychological consultation, assistant for suggesting music therapy [3], health care, online learning [4], for prioritizing service based on customer reviews etc.Using machine learning approaches such as supervised [5,6] unsupervised [7] or deep leaning based SER models [8] are built and validated by researchers.Design of SER model should be robust in terms of noisy natural environment, speaker gender, speaking rate [9].For any artificial intelligence model, also there is need to interpret the detected emotion state based on input features.In recent study, with technological advancements in speech based emotion recognition research gap identified is as below: • Need of identifying unique feature set for speech based emotion recognition.
• Exploration on effect of different data augmentation methods and cross-validation approaches in view of robustness of SER.• Lack of explainable robust SER model.Authors [10,11] introduced Local Interpretable Modelagnostic Explanations (LIME) and Shapley Additive Explanations (SHAP).The objective of LIME and SHAP is to explain model predictions faithfully with respect to local algorithm.In proposed work, machine learning based new robust and explainable SER model using data perturbations and various cross validation methods is introduced to address the research gap.Major contributions of the proposed work are: • Novel explainable artificial intelligence (XAI) based robust speech based emotion recognition model.• Novel feature fusion () is suggested to improve the SER accuracy.• Impact of data augmentations methods and different cross validation methods in SD and SI approach in robust, explainable SER.• Explanation regarding contributions of proposed reduced feature vector () in improving the accuracy as compared to state of art work on open source German language EMoDB database.
Proposed work is explained with introduction in first section.Related work with literature study is discussed in second section.Third section elaborates database, methodology of each stage in detail.Experimentation with results details is well explained in section four.Inferences derived are discussed in section five.Finally, conclusion with findings and future scope are put forth.

RELATED WORK
One of the challenging tasks in the design stages of SER is identifying emotion-discriminant features through feature engineering [9,12] as emotions vary according to age, gender and cultural background.One of the most dominant acoustic features for the detection of speech-based emotion is Mel Frequency Cepstral Coefficients (MFCC).Modifications in traditional MFCC feature extraction, the use of coefficient derived from magnitude spectrum without commonly used discrete cosine transform have been suggested by Ancilin et al. [13].Many researchers [5,[14][15][16], used MFCCs derived from sub-bands decomposed using multi-resolution based wavelet transform to capture emotion-specific acoustics features in both time and frequency domain.Deb et al. [5] extracted acoustic features from sinusoidal peak identified by threshold based on the median pitch frequency of sub-band signals derived from the three-level decomposition of the preprocessed input speech signal with pre-emphasis constant  = 0.94.These features have shown average performance improvement as compared with commonly used feature combinations.Palo and Mohanty [14] used 16 MFCC, 16 linear predictions cepstral coefficient (LPCC) features along with wavelet features derived from three-level decomposition with mother wavelet dB4 as baseline and feature selection with vector quantization method for emotion recognition.Shinde and Patil [15] presented improvement in the misclassification of happy and angry emotions with the feature fusion of spectral and selected emotion discriminative discrete wavelet transform (dB4) features.Kuchibhotla et al. [6] used Sequential Forward Floating Search-based feature fusion with 24 filter bank-based MFCC and LPCC features using Linear discriminant analysis (LDA) and SVM classifier.Nagarajan et al. [17] proposed gamma tone frequency cepstral coefficients (GFCC), human factor cepstral coefficients (HFCC) and 29 filter bank-based MFCC features in combination for emotion detection.Very few researchers explored speaker-dependent and speaker-independent approaches to design machine learning based robust SER systems.The limitation observed is majority of experiments focused only on improving genderwise average emotion recognition rate.For SER applications such as supportive diagnosis tool for psychologists or therapist, there is lack of an explanation of predictions made by a blackbox machine learning model.

METHODOLOGY
Details of the database used, feature extraction, feature selection methods and classifiers employed are explained in detail in this section.Figure 1 depicts the block diagram of the proposed methodology.

Database used
The proposed work uses the popular open-source German language speech emotion dataset EMoDB [18].It is a genderbalanced dataset with audio recordings of 5 male and 5 female speakers.Details of the database are listed in Table 1.Emotion-wise distribution of EMoDB utterances in percentage is represented with Figure 2. It is clear from Figure 2, EMoDB is a highly imbalanced dataset with seven emotion classes: angry, boredom, happy, neutral, sad, disgust, and fear.Different sentences are uttered by professional speakers in soundproof environments.Each emotion class has a different number of sentences by each of the speakers.

Preprocessing
In speech signal processing, pre-processing stages comprise of trimming, normalization, pre-emphasis followed by framing windowing [12].
Pre-Emphasis: Generally, the amplitude of high-frequency signal is affected in recording of database.With the help of first order high pass filter H(n), this effect is eliminated by boosting signal to noise ratio of input sample () as shown in Eq. ( 1) with  = 0.97.
Raw input Speech () is non stationary signal.This is converted into stationary input by dividing into frames of 25ms with 10ms overlap.These frames are subjected to windowing using hamming window function as represented in Eq. ( 2) to save information at the corner of frames.
Trimming: All raw input speech files are sampled at the rate of 44.1Khz.To focus only on information content, leading as well as trailing silence in speech recording is removed in trimming.
Normalization: All audio files are normalized with respect to max.amplitude of input speech signal.

Data augmentation and balancing
Availability of big SER data set is one of the major bottlenecks for SER researchers.In real time applications of SER especially in the domain of healthcare, mental disorders, [4] physiological therapy limited and unbalanced dataset is a common concern.Because of the limited data, overfitting is a common issue, where training accuracy is high while testing accuracy is low.In proposed work as observed from Figure 2, EMoDB is highly unbalanced and limited SER dataset.Hence, data augmentation is necessary.To design robust SER model, there is strong need to handle real time signal conditions such as noise, deviation in pitch etc.This necessity is utilized in audio data augmentation to generate audio signal, while preserving original signal emotional characteristics.This results in Increase of database size to develop generalized machine learning model [3,12,19].In proposed work, three audio data augmentation methods are employed.
• Addition of Gaussian Noise (GN): Additive white Gaussian noise (AWGN) of 15dB to 20 dB is added to input signal.This generates random noise similar to natural noise disturbances.Power spectral density of AWGN is flat and described by Eq. ( 3).
• Time Stretching (TS): Without changing pitch, speed of audio can be increased or decreased using time stretching.• Pitch Shifting (PS): Pitch shifting by 3-quarter tone is used in proposed work.This resembles original sound with echo.used for increasing the data from minority classes.In proposed work, we used data augmentation and SMOTE as data balancing method for German language database EMoDB.After converting into a balanced dataset, the extraction of the most discriminant features is essential.

Feature extraction
The proposed work creates a baseline feature vector by employing feature fusion.Each sub-set of baseline feature vector , has spectral features, time-domain features, and energy-based features, as shown in Eq. ( 4). = {, , 2, , ℎ, , , } (4)

Spectral features
After pre-processing, a baseline fusion feature vector of size 201 is formed.It consists of a fusion of time domain and frequency domain features.Frequency domain features are the mean of frame-level spectral features such as Mel Frequency Cepstral Coefficients (MFCCs), delta MFCCs, double delta MFCCs, Chroma features, and Mel-spectrogram coefficients.Time domain features include the mean of zero crossing rate and root mean square energy.All features are extracted using the Librosa library [21].
• Mel Cepstral Coefficients (MFCC) Features -Mel scale mimics the human perception.Therefore, to compute the MFFCs, the spectrum generated by taking the Fourier transform of the windowed signal is mapped onto the Mel Scale using triangular/cosine overlapping windows as represented in Eq. ( 5).Further discrete cosine transform is derived from the Mel log powers to obtain the Mel Cepstral Coefficients, and the first and second-order derivatives of MFCCs are referred to as delta and double delta coefficients.
where,  is the logarithmic representation of frequency f.
• DWT based features-discrete wavelet transform decomposes audio signal into approximate coefficients and detail coefficients according to the level of decomposition.The DWT-based feature vector is represented in Eq. ( 6).
It provides both time and frequency domain localization of signal.In proposed work, Db4 Wavelet Transform is used.discrete wavelet transform perform signal analysis with multi sub band resolution.It is represented by Eq. (7).
For every emotion, there is considerable variation in RMSE.

Feature selection
To overcome the curse of dimensionality, a reduced feature set is obtained by employing unsupervised principle component analysis (PCA) [3], Various filter-based methods, Analysis of Variance (ANOVA) [15,23] wrapper method and vector quantization (VQ) [14] are employed by researchers for dimensionality reduction.Bandela and Kumar [7] have introduced feature selection based on adaptive structure learning and unsupervised Ordinal Locality This feature subset reduced training time along with better emotion recognition accuracy in noisy environments.Sheikhan et al. [23] have used ANOVA F1-score-based feature selection to identify distinct features.Sheikhan et al. [23] identified F-score to obtain the emotion discriminating features for different combinations of classes.In the proposed work, two-dimensionality reduction approaches, ANOVA and PCA, separate the most precise and emotion-specific features

Analysis of Variance (ANOVA):
The feature selection approach aims to obtain a subset of relevant and reduced features from a bigger feature space to make subsequent analysis efficient.In the proposed work, we used filter-based ANOVA feature selection method for Level 4-decomposed statistical features of DWT Statistical features like Mean, Minimum (minv), Maximum (maxv), SD for each sub-band coefficient of the Four Level DWT, namely cA4, cD4, cD3, cD2, cD1 i.e. in total 20 features for the audio was extracted as described in section 3.4.Figure 5 depicts the box plot of sub-band features cA4_sd and cD4_sd for all emotion classes.ANOVA statistical test is used to select the most significant of these statistical features for all the coefficients.F-ratio and pvalue suggestively identify the discriminating features based on F-ratio ranking; only six features having the top six values are selected in proposed work, as depicted in Table 2.It represents the principle components Based on eigenvalue and eigenvector [3].Direction of maximum variance is indicated with selected principal components.Univariate and multivariate analysis is performed in explorative data analysis of extracted statistical 12 mean_chroma features.Multivariate analysis of pitch class based 12 Chroma _mean features shows a high correlation amongst all seven emotion classes.Therefore, PCA is applied to the chroma feature set.By applying PCA on 12 mean_chroma features, it is observed that only three principle components are sufficient to represent 90% of the variance.

EXPERIMENTS AND RESULTS
SVM is majorly used by researchers [5,13,14,[24][25][26] since various kernel such as radial basis function, linear, sigmoid, polynomial contribute in separating data into high dimensional space for classification.Multi-layer Perceptron (MLP) is a basic neural network suitable for SER supervised classification task [27,28].The number of hidden layers in between input and output layer decides depth of MLP.Authors [25,29] used bagging and boosting ensemble learning approach for stable and improved SER performance.Extreme Gradient Boosting (XGBoost) is popular due to its parallel structure, efficiency, inbuilt capabilities of regularization to avoid overfitting and handling missing values [29,30].XGBoost aggerates individual models iteratively into ensemble by reducing root mean square error at each step.This results in efficiency in speed and performance.Therefore, in proposed work three different classifiers SVM, MLP and XGBoost classifiers are used for testing the performance of proposed reduced feature vector of SER.All these classifiers can also be used in limited resources unlike deep learning classifiers.To ensure robustness of proposed SER model and performance validation both SD and SI classification experiments are carried out [7].To ensure performance accuracy, reliability exhaustive experimentation of SD experiments with holdout Cross Validation (CV) method and stratified k-fold cross validation with k=5 and k=10 are explored.In machine learning based SER, SI approach is less explored.In our work, for SI experiments leave-one-speaker-out (LOSO) and leave-two-speakers-out (LTSO) approaches are explored.Details of experiment with basic dataset and augmented dataset along with SMOTE for data balancing are explained in detail in further subsections.

SD experiments
In these experiments, distribution of 80:20 for training and validation data is employed.After dimensionality reduction, reduced feature vector is subjected to standard scalar and balancing with SMOTE and further fed to classifier for training.This is done for basic dataset and augmented dataset, to analyze effect of data augmentation.Performance measures of accuracy, precision, recall and F1-score majorly used in multiclass classification i.e. speech based emotion recognition are used [1].

SD Experiment + SMOTE + Holdout cross validation with basic EMoDB dataset
For basic EMoDB database, emotion classes of 7, average performance accuracy with SVM classifier is 62.62%, 65.42% and 69.16% for baseline (F), Baseline+ANOVA selected DWT (f+fa) and proposed reduced feature vector f(r) respectively.In SVM the hyper parameters 'C' of RBF kernel is set to 10. Figure 6 represents confusion matrix for MLP classifier with proposed reduced feature vector with SMOTE.From Figure 6, it is evident that angry and sad emotion classes have better recall for MLP classifier.As observed in Table 3, average accuracy is improved for reduced feature vector experiments as compared to baseline and baseline with ANOVA selected wavelet features.Without data augmentation, average performance accuracy of 69.16%, 68.22% and 63.55% are obtained for SVM, MLP and XGBoost classifiers respectively.[3,19] database size increased by four times.Proposed Feature vector expressed with Eq. ( 4) is extracted and after feature selection, reduced feature vector () is used for training and validation.With data augmentation methods, significant improvement in average performance accuracy is achieved listed in Table 3.
From Table 3, it is observed that the average recognition accuracy obtained from SVM classifier with radial basis function (RBF) kernel and hyperparameter C=10 is improved significantly in case of classification using augmented dataset as compared to that using basic dataset.From Figure 6 and Figure 7, it is observed that results of classifier are better for all emotion classes.With the use of augmented EMoDB dataset and SMOTE for data balancing, ensemble Gradient Boosting i.e.XGBoost classifier has also shown significant improvement from 62.62% to 86.68% in overall performance accuracy.From Figure 7, it is clear that for augmented EMoDB database, angry emotion class has highest detection accuracy in the range of 99% for MLP classifier.

SD Experiments with k-fold cross validation with basic and augmented EMoDB
To overcome overfitting and demonstrate robustness of model performance validation is carried out on augmented audio data with stratified 5-fold and 10-fold cross validation.Mean and standard deviation scores for stratified 5-fold and 10-fold are listed in Table 4.

SI experiments
SI approach is essential for robust SER system.In proposed work, to assure robustness, two approaches of cross validation are employed.

LOSO approach
In this approach, out of 10 speakers of EMoDB, each time only one speaker forms the test set and rest 9 speakers form the corresponding training set.This is repeated 10 times and average accuracy is obtained for these 10 fold for basic and augmented EmoDB dataset.Figure 9 shows comparison of fold-wise accuracy for each speaker in augmented EMoDB with SVM (RBF), MLP and XGBoost Classifier.It is observed that, for augmented dataset average accuracy of Speaker 3,8,9,10,12,13,14,15 has accuracy equal to or above 60% whereas speaker 11 and 16 has less accuracy.Speaker 16 has lowest avg.performance accuracy as compared to rest all speakers.

LTSO approach
Another approach used for speaker independent experiment is leave two speaker out (LTSO).In proposed work, in every fold two speaker's combination of one male and one female such as (3,8), (9,10), (11,13), (12,14) and (15,16) are assigned as test set whereas remaining all 8 speakers except speakers in test set are assigned in training set.Such 5-folds of LTSO approach are evaluated on basic and augmented dataset.Figure 10 shows fold wise average performance accuracy of LTSO.It is observed that, with fold IV (12,14) test speakers highest average accuracy of 68.27%, 67.50%, 68.50% is obtained with SVM (RBF), MLP and XGBoost classifiers respectively.Literatures [11,31,32] show SHAP, LIME library are used for interpretation of black box model results.Pham et al. [31] interpreted contribution of sliding window concept based features.Yildirim et al. [32] used SHAP for identification of most contributory features with respect to gender and emotion.In proposed work, trained model of XGBoost classifier is analyzed in detail for explainability.After fitting of model, feature importance is observed for proposed reduced feature set.From experimentation results of XGBoost ensemble learning classifier [29] analysis of all test set, SHAP values are studied.Figure 11 shows the first 10 contributing features in a sorted order from the proposed reduced feature set contributing in final prediction of respective emotion class.It shows the summary of SHAP value plot for XGBoost classifier with average performance accuracy of 86.68% for seven different emotion class.From Figure 11, it is clear that PCA1 component from reduced feature vector is the major contributor for prediction of emotion class 6 i.e. 'sad'.Also cD4_sd DWT feature is amongst top -5 contributing features in detection of emotion class 0. i.e. 'angry.where m_i, d_m_i d2_m_i represents value of i th feature of mean out of 13 MFCC, 13 delta-MFCC and 13 double delta MFCC features respectively.LmelS_i represents mean value of i th feature out of 128 log-mel Spectrogram features used in feature vector.PCA1, PCA2 and PCA3 represent the principle components of Chroma features in reduced feature set.

DISCUSSION
Table 3 shows that for speaker-dependent experiments with holdout cross-validation, significant improvement is observed for reduced feature vector compared with only baseline and feature combination of baseline with ANOVA selected discrete wavelet features for both basic and augmented EMoDB.After augmentation, SVM (RBF) average accuracy performance got improved from 69.16% to 93.69%, for XGBoost classifier accuracy improved from 63.55% to 86.68% and for MLP classifier accuracy improved from 68.22% to 94.63% for proposed reduced feature vector.From Table 4, it is clear that speaker-dependent results of stratified 10-fold are robust as performance of SVM (RBF), XGBoost, and MLP classifier are almost the same for 5-fold and improved as compared with holdout performance.This signifies the robustness of the proposed SER.Comparing Figure 6 and Figure 7, confusion matrix of MLP classifier with basic EMoDB and augmented EMoDB depicts that deviation of each individual emotion class from average performance is improved significantly after augmentation.The effect of data augmentation is well signified by the graphical presentation of performance measures precision and recall with Figures 8 (a) and (b) respectively.For SVM(RBF) classifier, the highest average accuracy of 69.16% obtained with () with basic EMoDB has an average recall (UAR) of 68.11%.
Speaker-independent experimentation results listed in Table 5 show that a maximum average performance accuracy of 60.94% is obtained for LOSO and 58.59% for the LTSO approach, respectively.From Figures 11 and 12, the contribution of each feature from the reduced feature vector for each emotion class in the test dataset is well explained.From Figures 12 (a) and (b) it is evident that the proposed reduced feature set formation is justified by observing the class-wise feature value of the first ten contributing features.As observed in Figure 12 (a) for prediction of angry emotion class, logMelS_1 to 3, cD4_sd, logMelS_9, logMelS_11, 6 th and 11 th MFCC along with MFCC double derivative 4 i.e. d2_m_4 are majorly contributing features.With the use of SHAP [33], the proposed reduced feature is well explained for distinct emotion classes of German language-based speech emotion classification and prediction of emotion classes with interpretation.The proposed robust and explainable SER system is listed in Table 6, with state-of-the-art work carried out by the researcher on the German language open source database EMoDB.Table 6 depicts performance comparison with state of art for speaker dependent approach.In speaker independent approach, maximum average weighted accuracy of 64.52% with leave-two-speaker-out (LTSO) approach is obtained.

CONCLUSION
Proposed work with novel feature fusion improved the SER accuracy using German language database EMoDB.Use of supervised feature selection method ANOVA, for distinct discrete wavelet transform level-4 using dB4 (DWT) based features as well as derived principle components for chroma features reduced feature vector () from 201 to 178.This enhances the performance of SER significantly.Performance of novel reduced feature vector () is validated using exhaustive SD experiments with support vector machine (RBF), MLP and ensemble XGBoost classifier with holdout and k-fold with k=5 and k=10 cross validation experiments.
The proposed novel reduced feature vector () shows significant improvement in the performance on employing data augmentation and SMOTE data balancing technique.In case of speaker dependent experiments using augmented EMoDB, the average performance accuracy improved from 68.22% to 94.63% for MLP classifier while the average accuracy increased from 69.16% to 93.69% and 63.55% to 86.68% with SVM(RBF) and XGBoost classifiers respectively.Robustness of the SER is also demonstrated through the k-fold cross validation experiments with k=5 and K=10 where considerable improvement in average accuracy is observed with augmented EMoDB as compared to that with basic EMoDB dataset.For stratified 10-fold cross validation using augmented EMoDB dataset, average performance accuracy is seen to improve from 73.07% to 93.18% with SVM (RBF) classifier while it is boosted from 74.14% to 94.07% and 71.02% to 87.15% with MLP and XGBoost classifier respectively.
In case of speaker dependent approach, proposed explainable SER performs better as compared to state of art experiments.Proposed work can be used in many applications of human machine interface.For speaker independent experiments maximum performance accuracy of 60.94%, 58.59% of is obtained respectively in case of LOSO and LTSO methods.The current work can be further extended by employing adaptive techniques in feature selection, classification for improving the performance of speaker independent approach.

Figure 3 Figure 3 .
Figure 3. Original and augmented audio signals Shinde et al. [3] employed these three audio augmentation methods along with additional pitch change method in SER.This results into generalized SER model for suggestion of music therapy.Basic EMoDB has 535 audio utterances, and data augmentation results in augmented database size of 2140 audio utterances.Praseetha et al. [19] used data augmentation methods in time domain by stretching of audio signal duration and resampling of input signal.Along with data augmentation, all emotion class wise audio utterance percentages are made same with oversampling the lower distribution classes.Liu et al. [20] used variations of data balancing technique to nullify the effect of biasing by major class.In proposed work, Synthetic Minority Oversampling Technique (SMOTE) is

(
SD) wavelet features are extracted of five sub-bands as represented in Eq. (6).Total 5 subbands of 4 statistical features i.e. 20 discrete wavelet coefficients are extracted per utterance.• Chroma Features-Human perception is based on the pitch of the audio signal.Pitch is described using two components: tone height and chroma.Chroma features represent audio signals in 12 octave form.In the proposed work, 12 mean_chroma features per utterance are extracted.• Log-Mel Spectrogram based features-In proposed work, 128-Log-Mel Spectrogram coefficients per utterance are extracted with maximum frequency of 8KHz.3.4.2Time domain feature-zero crossing rate (ZCR) Zero crossing rate is mainly used to distinguish the signal and silence portion of input speech.ZCR_mean is extracted from the audio recordings of the EMoDB dataset.
Figure 4   shows variation in ZCR_mean for all seven emotion classes of EMoDB.

Figure 5 .
Figure 5. Box plot of cA4_sd and cD4_sd featuresTable 2. Discriminative sub band features based on ANOVA Rank Based on F-Ratio

Figure 6 .
Figure 6.Confusion matrix of MLP classifier with 80:20 holdout CV for basic EMoDB 4.1.2SD Experiment + SMOTE +Holdout cross validation with Augmented EMoDB dataset After employing three different data augmentation methods of GN, PS and TS[3,19] database size increased by four times.Proposed Feature vector expressed with Eq. (4) is extracted and after feature selection, reduced feature vector () is used for training and validation.With data augmentation methods, significant improvement in average performance accuracy is achieved listed in Table3.From Table3, it is observed that the average recognition accuracy obtained from SVM classifier with radial basis function (RBF) kernel and hyperparameter C=10 is improved significantly in case of classification using augmented dataset as compared to that using basic dataset.From Figure6and Figure7, it is observed that results of classifier are better for all emotion classes.With the use of augmented EMoDB dataset and SMOTE for data balancing, ensemble Gradient Boosting i.e.XGBoost classifier has also shown significant improvement from 62.62% to 86.68% in overall performance accuracy.From Figure7, it is clear that for augmented

Figure 7 .
Figure 7. Confusion matrix of MLP classifier with 80:20 holdout CV for Augmented EMoDB Figures 8 (a) and (b) depicts effect of data augmentation on the performance measures average accuracy, precision and recall in case of speaker dependent experiment for MLP classifier with basic and augmented EMoDB for holdout cross validation.

Figure 8 .
Figure 8.Effect of data augmentation on performance measures

Figure 11 .Figure 12 .
Figure 11.SHAP summary plot of test data for XGBoost Classifier with proposed reduced feature vector ()

Table 4 .
Performance of k-fold cross validation for SD experiments with proposed reduced feature set()

Table 5
represents summary of both Leave One Speaker Out (LOSO) and Leave Two Speaker Out (LTSO) methods adopted for speaker independent experiments for EMoDB dataset.

Table 5 .
SI experiments summary with proposed reduced feature set()

Table 6 .
State of art comparison of speaker dependent SER experiment on EMoDB