Indian Cross Corpus Speech Emotion Recognition Using Multiple Spectral-Temporal-Voice Quality Acoustic Features and Deep Convolution Neural Network

ABSTRACT


INTRODUCTION
Affective computing seeks to facilitate people's natural interaction with computers.One of the main goals is to enable computers to understand people's emotional states so that customized answers may be provided in response [1,2].Recent years have seen an increase in interest in SER, which is often done on the premise that spoken sounds in training and testing datasets are generated under the same circumstances.However, as voice data are often gathered from many devices or places, this assumption does not hold true in practice.Due to the disparity between the training and testing datasets, SER suffers from Class imbalance problem [3,4].
Emotions reflect the psychological state of the human being.Various physiological and psychological signals such as speech, facial expressions, and electrocardiograms (ECG), electroencephalograms (EEG) are utilized for the manifestation of emotional reflection.Speech is the natural and easiest way of interaction that comprises huge emotional content and context.SER is the most straightforward way of human-machine interaction (HMI).Generalized SER systems use the same corpus for training as well as testing purpose, which may cause poor outcome for the new corpus [5][6][7].SER is very challenging due to many factors such as age, health status, gender, linguistic variability, cultural variability, recording environments, and languages with distinct corpus.The speech attributes show high variance for different corpus which leads to poor recognition rate for the SER systems designed for single corpus.Now a days, various cross-corpus SER systems have been implemented that use one dataset for training and another for testing [8,9].
In past decades, most of the SER techniques uses same corpus for the training as well as training and researchers have achieved noteworthy success for the SER under controlled experimental boundaries [10][11][12].Earlier SER uses traditional machine learning (ML) techniques such as Gaussian Mixture Model (GMM) [13], Hidden Markov Model (HMM) [14], Support Vector Machine (SVM) [15], K-Nearest Neighbor (KNN) [16], Random Forest Classifier (RF) [17], Artificial Neural Network (ANN) [18], etc along with handcrafted feature extraction and pre-processing schemes.In recent years, deep learning (DL) has engrossed the extensive attention of investigators for the SER because of robustness, high feature depiction capability, ability to work for larger dataset, higher recognition rate, etc. various deep learning techniques has been presented successfully for the SER such as Auto encoder (AE) [19], Convolution Neural Network (CNN) [20], Deep Belief Network (DBN) [21], Long-Short Term Memory (LSTM) [22], Recurrent Neural Network (RNN) [23], etc.Though the traditional ML and DL based SER has achieved extraordinary progress under the controlled experimental environment, the generalization of SER system is key challenge that is key to endorse the SER systems for real time applications.Thus there is need to increase the generalization capability of the cross corpus SER to enhance the outcome of SER for different corpuses [24,25].Most of the SER techniques presented in the past used English and European languages for the training which limits the outcome for the other languages due to cultural, regional, and linguistic variations.Very less focus has been given on SER for Indian languages though there are vast variations in the Indian corpus and regions.Therefore, there is need to present the SER system for Indian languages that can help to model the linguistic variations in Indo-Aryan and Dravidian language family.This paper presents cross corpus SER using multiple acoustic features and Deep Convolution Neural Network (DCNN) for four Indian Languages such as Hindi, Urdu, Telugu and Kannada.The chief influences of the work are summarized as follow: • Robust feature representation of speech signal using multiple acoustic features consisting of spectral, temporal, and voice quality features.• Salient feature selection with higher inter-class and lower intra-class variance using Fire Hawk based optimization scheme • Design of lightweight one-dimensional DCNN for improving the feature distinctiveness for CCSER.• Analysis of the proposed CCSER scheme for single corpus and multi-corpus training and testing for four Indian languages such as Hindi, Urdu, Telugu and Kannada.The outcomes of the anticipated SER scheme are validated using accuracy, precision, recall, and F1-score.
The remaining of the paper is arranged as follow: Section 2 describes the information regarding various techniques used for SER and CCSER in recent years.Section 3 gives exhaustive depiction of the dataset, acoustic features and DCNN model.Section 4 depicts the analysis and discussions of simulation results suggested CCSER scheme.Lastly, Section 5 concludes the work and offers the future scope for possible boost in the proposed CCSER scheme.

RELATED WORK
The representation of highly discriminative features in deep learning (DL) has drawn the attention of investigators in last decade.The application of deep learning algorithms is possible for both feature extraction and classification.Zhang et al. [26] presented DCNN for SER to cover the semantic disparity between low level information and subjective emotions.The model used three log MFCC features, including delta, static, and delta-delta coefficients for training AlexNet.Discriminant Temporal Pyramid Matching (DTPM) is utilized to combine learnt high level characteristics.SVM has been used by them to classify emotions.Extensive testing on the EMO-DB, RML, eNTERFACE05, and BAUM-1s databases has revealed encouraging results, and it has been found that a DCNN that has been pre-trained for image applications may also be utilized to extract voice features.The LP-norm pooling provided superior results compared with maximum and average pooling.Neumann [27] introduced alternative CNN (ACNN) for cross-lingual and multi-lingual SER.Arousal prediction benefits from fine tuning with fewer parameters, but valence prediction is susceptible to cross-language training.As opposed to monolingual and cross-lingual training, it has been shown that multilingual training often results in greater performance.Cross-lingual training that has been fine-tuned may greatly enhance the system's effectiveness.Zhao et al. [28] explored merged DNN for SER which is merging of 1D-CNN and 2D-CNN are applied to audio clip and spectrogram.It utilized Bayesian optimization for fine tuning of the integrated features.By shifting a deep learning model from a bigger dataset to a smaller dataset, its outcome may be enhanced.On the Berlin EmoDB and IEMOCAP datasets, the combined CNN produced accuracy rates of 89.77% and 86.36% for speaker dependent and independent SER systems, respectively.Ocquaye et al. [29] investigated Dual exclusive attentive transfer (DEAT) to modify the source and target domains for unsupervised CNN. to reduce the domain inconsistency on the source and target attention maps' second-order statistics.It employs correlation alignment loss (CALLoss).To discover the discriminated and salient feature learning, a spectrum is employed.To a five-layered CNN, raw spectrogram features are sent.Although it is easy to tune, the bigger feature vector led to a higher level of computational complexity.
Lotfian and Busso [30] used DNN to produce the curriculum for SERs ambiguous emotional speech.Simple, recognizable examples are taught initially in curriculum learning, followed by complicated samples.The fundamental frequency and MFCC are employed for feature extraction.In comparison to previous baseline approaches, it has significantly improved.Tripathi et al. [31] presented CNN based SER using speech signal and transcript.Text and voice MFCC characteristics are applied by CNN and gathered in a fully linked layer for classification.In comparison to existing benchmark methodologies, it has shown outcome increases of about 7%.Zhao et al. [32] deployed one 1-D CNN-LSTM and two 2-D CNN LSTM networks to learn the long-term dependencies from the emotion characteristics collected using MFCC LSTM.Long-term dependencies and local information are included in the feature retrieved using CNN LSTM.
Peng et al. [33] suggested 3D convolution and attentionbased sliding RNN (ASRNN) to learn both the dynamics of emotion and cognitive continuity.The periodic information and regional characteristics of the voice stream are developed using 3D convolution.The local level feature representation of the speech signal is aided by ASRN.The results from the attention model were superior to those from maximum and mean pooling.In comparison to frame-based attention models, segment-based attention models have shown better results.Due to a data imbalance issue, the MSP-IMPROV dataset (Accuracy-55.70%)produced unsatisfactory results.Traditional approaches don't generalize well and miss latent data in databases.Class-aligned GDANN (CGDANN) reduces class alignment issues brought on by a small number of labelled targets, and generalized domain adversarial neural network (GDANN) provides a domain invariant and generalized representation of speech data [34].Ai et al. [35] proposed an attention model integrated convolution RNN (ACRNN) with redagging and augagging mechanisms for SER imbalance.It utilized redagging to address the issue of inspection recurrence, while augagging addresses the issue of a missing complete image.
Xia et al. [36] suggested that DNN-based SER captures the temporal segment-level aspects of low-level features of voice signals.It used low-level elements of the emotion signal linked to energy, spectral, statistical, and voice.Results from the attentive temporal pooling have outperformed those from the typical pooling.Chen et al. [37] explored first-order attention networks to address the issue of data imbalance and utterance variability.To optimize the segment-level properties collected from the log Mel spectrogram, a pre-trained CNN (VGG) network was utilized.
Furthermore, discriminative segment-level features are learned using the bidirectional LSTM (Bi-LSTM).It reduced the problems with utterance variety and imbalanced data.The collaborative structure of labeled and unlabeled data and categorization has also been learned using Smooth semisupervised generative adversarial networks (SSSGAN).The dependence on the tagged data was decreased through virtual smoothed SSGAN (VSSSGAN).It is resilient to data alterations and can handle domain mismatch issues.A bigger dataset was needed to smooth the model in an adversarial direction [38].Falahzadeh et al. [39] explored 3-D representation of the speech emotion signal known as "chaogram" that can characterize the meaningful information of the speech emotion signal.
Further, VGG-based DCNN is used to learn the high-level attributes of the chaogram.The Grey Wolf Optimization (GWO) algorithm is used to optimize the hyper-parameters of the proposed DCNN architecture.The suggested approach provides promising results on the EMO-DB and eNTERFACE-05 datasets.Prakash et al. [40] presented a Gated Recurrent Unit and CNN (CNN-GRU) to investigate robust, discriminative, and emotional salient features for the SER.Aggarwal et al. [41] investigated two-way feature representation of the speech signal for SER based on Principal component analysis (PCA) and Mel spectrogram.The first phase includes spectral feature extraction (MFCC, centroid, roll-off), feature normalization using MinMaxScaler, and feature reduction using PCA and DNN for feature representation.In the second phase, Mel spectrograms are provided to VGG16 for feature learning.The proposed SER scheme outperforms traditional techniques and provides 81.94% and 97.15% accuracy for eight classes of SER on RAVDESS and TESS datasets, respectively.It needs more extensive trainable parameters (782K for DNN and 138M for VGG16) that add computational burden on the system and make it less suitable for implementation on standalone devices with lower computational ability.Cross-corpus SER encounters the problem that the speech signals are highly diverse regarding background noise, echo, recording equipment, language, speaker, and repercussions, which results in corpus bias since the training and testing data are gathered from distinct datasets [42].The comparative analysis of the various SER techniques is provided in Table 1.
Deep learning-based approaches aided multiclass voice emotion recognition.It can provide a strong connection and representation of the unprocessed emotional input.Compared to conventional ML-based approaches, DL-based techniques have been demonstrated to be much more effective.Deep learning methods have several drawbacks, including architectural complexity, the class-imbalance issue, longer training times, difficult hyper-parameter adjustment, etc [43][44][45][46].Very few researchers worked on the Indian languages for the SER.The Indian languages have multiple families and dialects that affect the intonation, timbre, and prosodic changes over the speech.The current systems for the Indian language SER are less generalized as the SER system designed for one corpus is unsuitable for the other [47][48][49][50].The effectiveness of the SER system is highly affected by the quality of the features; therefore, the proposed work provides MAF for describing the significant characteristics of the speech and FHO-based feature selection to enhance the crosscorpus SER.Indian languages have two prominent families: Indo-Aryan and Dravidian.This work selects two languages, Hindi and Urdu, from the Indo-Aryan and Kannada, and Telugu from the Dravidian family.Larger trainable parameters and higher computational complexity

Dataset
The outcome of the proposed SER scheme is estimated on four languages from two Indian language families.It uses Hindi and Urdu corpus from the Indo-Aryan language family and Telugu and Kannada corpus from the Dravidian Language family.Four common emotions such as angry, happy, neutral and sadness are selected for the CCSER.Out of total data 70% and 30% data is considered for the training and testing purpose respectively.The summary of various datasets used is given in Table 2.

Methodology
The framework of the proposed SER scheme is given in Figure 1 that encompasses the preprocessing, multiple acoustic feature extraction and DCNN for CCSER.The multiple acoustic features are used to capture the temporal variations in the emotion signal using time domain features, spectral properties of the signal using various spectral features and variation in amplitude and frequency of the voice signal using voice quality features.The DCNN improves the connectivity and correlation of the different local and global acoustic features for effective CCSER.

Multiple acoustic features
Wide range of acoustic features are extracted for the signal to characterize the changes occurs in the voice due to emotion.The proposed feature set consists of time-domain features, spectral features and voice quality features as given in Table 3.The feature vector is further given to FH for prominent feature selection and these salient features are further given to DCNN architecture to improve the feature representation.

A. Multi-taper MFCC
Only one hamming window with a higher variance is used in generalized MFCC, which is unable to acquire the disparities over the frame of speech signal.The speech is filtered using moving average filter to lessen the amount of noise it contains during the pre-emphasis stage.As a part of framing process, the entire signal is divided into the frames of 40 ms each.This is necessary for multi-taper windowing to accumulate the adjoining spectral components together.The DFT is used to transform signals from the time to the spectral domain.The linearly scaled signal is then converted to Mel frequency, which can be perceived by human hearing.Discrete Cosine Transform (DCT) is used to transform signal back to time domain to reduce the signal's redundancies.13 cepstral values are chosen as the features after log filter-bank energy has been calculated over the frames.Figure 2 depicts the MTprocess MFCC's flow.In contrast, the windowing of the speech signal in the MT-MFCC uses various tappers with diverse variations, which aids in increasing frequency resolution.The sign weighted ceptrum estimator (SWCE) provides low error compared with traditional MFCC as given in Eq. ( 1) [29,30]. ( where, N denotes total frames,   depicts tapper window, and p=1,2,3, …., M. The weights of SWCEfor each taper are estimated using Eq. ( 2) [31].
After multi-taper windowing the speech signal is converted in to spectral domain signal using fast Fourier transform (FFT).The Mel frequency spectrum is further changed in to time domain using DCT and to minimize the redundancy in the signal, Total 39 MTMFCC coefficients are considered for the representation of voice signal such as 13 MFCC coefficients, 13 Δ-MTMFCC coefficients and 13 ΔΔ-MTMFCC coefficients.Figure 3 depicts the conception of the phases of MT-MFCC.

B. LPCC
The linear predictive analysis's spectral feature known as the LPCC is used to reflect the speech signal's emotionspecific phonological interpretation.The LPCC does a fantastic job of describing aspects of the human vocal tract that assist to specifically identify the emotional content of speech.In linear predictive analysis, the knowledge of prior p samples may be used to estimate the nth samples, as shown in Eq. (4).
where,  1 ,  2 , … . .  are the constants over the speech frames.The speech sample is predicted by these linear predictor coefficients.The suggested method takes into account a total of 13 LPCC coefficients [13] as characteristics.

C. Spectral Kurtosis(SK)
This denotes the sequence of transients together with their spectral domain locations.It describes the non-Gaussianity or smoothness of the speech frequency spectrum around its centroid, which demonstrates the impact of varying levels of arousal and emotional valence on the speech spectrum.The spectral kurtosis of the voice is calculated using Eq. ( 5).
Here,  1 depicts the spectral centroid,  2 denotes spectral spread respectively,   is spectral value over  bins,  1 and  2 are the lowest and highest bound of the bins where SK of voice is computed.

D. Formants
Formants are energetically intense frequency peaks in the spectrum.They are predominantly noticeable in vowels.Every formant has an associated vocal tract resonance that characterizes the influence of the emotion on the speech signal.Here, 3 formants, mean and standard deviation of formants are considered to characterize the spectral changes due to emotion in the voice signal.

E. Pitch Frequency
To illustrate the vocal component of communication, pitch ( 0 ) is important.By calculating the disparity between the peaks obtained by the speech signal's autocorrelation, the pitch of the speech may be determined.

F. ZCR
ZCR offers the signal's passage through the zero line, which represents the degree of noise in the voice signal.ZCR is computed in the time domain via Equation 6.Over a time period, the sign function returns a value of '1' for the positive amplitude and '0' for negative amplitude (t) of speech.

G. Jitter and Shimmer
The fluctuations in frequency and amplitude of the speech induced by aperiodic vocal fold vibrations are known as jitter and shimmer, respectively.The breathiness, hoarseness and roughness of the emotional voice are portrayed by the jitter and shimmer.The mean jitter is given by Eq. (7).
where,   denotes the time period in sec and  provides total periods.Eq. ( 8) represents mean shimmer.
where, A i is peak-to-peak amplitude of speech signal and Nis the number of periods.Eq. ( 9) represents the feature representation () provided to FHO for selecting prominent feature.

FHO based feature selection
Finding significant features and reducing feature vector length both depend on feature selection.The SER system's output is enhanced by the less complex but still useful features.The FHO is a metaheuristic algorithm that mimics the fire hawk's (FHs) fire-starting, fire-spreading, and prey-catching food foraging behaviour.To prevent local optimum entrapment, which improves global optimal solution, the FHO takes into account the average of the solution candidates in a certain region.Figure 4 shows the flow chart for effectively choosing characteristics from a collection of various acoustic features.
The biggest risks to animals and ecosystems come from wildfires that are caused by either a natural phenomenon or by humans.Many times, birds referred to as "fire hawks" such whistling kites, black kites, and brown falcons may purposely start a fire in order to catch prey to eat.
By holding the flaming sticks in its beak and dumping them over the area that hasn't yet caught fire, the fire hawk carefully spreads the flames in order to capture its victim.The fire hawks start little flames to frighten their prey, which includes snakes, rats, and other small animals.This causes the prey to make a hurried, panicked decision, which makes it easier for the FHs to capture the prey.
The number of viable solutions (A) shows where the FHs and their prey were first located.By taking into account the precise restrictions for each parameter as provided in ( 10) and (11), the population of FHs and preys is first initialized at random.Values of α, , and  are taken into account by the populace.Here, N stands for the number of solutions,   signifies the  ℎ decision variable of the  ℎ solution, A i,max j and A i,min j stands for the decision variable's upper and lower bounds, respectively.( To assess the fitness of the solutions, the objective function based on the intra-class and inter-class variance is utilised.Other solutions are seen as prey while the ones with the best fitness are kept as FH.Global solutions are used to start fires since they are thought of as the original chief fires.To facilitate hunting, the chosen FHs are employed to start the fire in the unburned region.( 12) and ( 13) include descriptions of the preys and FH.Here,   is  ℎ prey in search space and   is  ℎ FH in search space.
In next phase, the FH catches the nearby prey based on distance metrics given in 14.Here,    shows distance between  ℎ FH and  ℎ prey, ( 1 ,  1 ) and ( 2 ,  2 ) indicates the coordinates of location of FH and prey.
The FHs choose burning sticks and start fires in unburned areas of their territory during the next phase to trap their prey and make their escape fast and challenging.The FHs use (15) to update their location in order to defend their area and stop other fire hawks from grabbing burning sticks.In this case, the values of  1 and  2 are arbitrary numbers between 0 and 1 that control the direction of the FHs raids on the main fire (global solution).
When the prey sees the flaming sticks laid out on the ground, it begins to flee, hide, or inadvertently run towards the hawks.Using ( 16), the prey's location is updated for this case.Here,  3 and  5 are random numbers between [0,1] which represent coefficients for the movement towards hawks and a safe spot, and PR q new represents the prey encircled by lthfire hawks.
The prey may sometimes relocate to another hawk's territory or to the safest location outside the area.The updated position is shown by the number (17).Here,  5 and  6 stand for the random number between [0, 1] that designates the coefficients for the migration of prey towards other hawks and towards safe locations beyond the region.
The fitness of the solution is calculated using (20) which is based on the intra-class and inter-class variability of the speech features.The higher inter-class and lower intra-class feature variability helps to selects the hugely discriminative combination of feature set for SER.

DCNN architecture
The DCNN provides the short term and long-term correlation and connectivity in the local and global features extracted using MAF.It characterizes the changes occurs on the valence and arousal on the voice due to emotion.It helps to represent the variations in intonation, prosody and timbre of the speech because of the emotions and language.It provides high level abstract features that give better distinctiveness for different emotions.The proposed lightweight DCNN consist of three layers of sequential CNN.Each layer of CNN consists of convolution layer (Conv), Rectified Linear Unit Layer (ReLU), and maximum pooling layer (MaxPool).The DCNN accepts the handcrafted feature vector that represents various spectral, temporal and voice quality features of the emotion signal.The first layer in proposed DCNN includes three layers {Conv1(KernelSize-1×3, NumFilter-64, Stride-1, ZeroPadding-Yes) →ReLU1 (Stride-1) →MaxPool1 (Stride-2)} which accepts multiple acoustic feature vector as an input with dimensions of (1×318) and produces the output feature map of (1×318×64).Zero padding maintains the original dimensions of the multiple acoustic features.The second layer is made up of {Conv2(KernelSize-1×3, NumFilter-128, Stride-1, ZeroPadding-Yes) →ReLU2(Stride-1) →MaxPool2(Stride-2)}. Third layer encompasses {Conv3(KernelSize-1×3, NumFilter-256, Stride-1, ZeroPadding-Yes) →ReLU3 (Stride-1) →MaxPool2(Stride-2)}.
The convolution layer feature map () of 1-D acoustic features A() and L convolution filter ()using Eq. ( 21).Eq. ( 22) describes convolution feature map where y i l stands for i th feature map of  ℎ layer, y j l−1 denotesjth feature of ( − 1) ℎ layer, k ij l representthe filter kernel of  ℎ layer linked to  feature, b i l denotes for bias and σsymbolizesReLU activation function.The ReLU activation function is faster and simple that replaces negative values by 0 to overcome the vanishing gradient issue using Eq.(23).
Here, h j denotes penultimate layer's weight, w ji describes the weights joining Softmax and penultimate layer, z i representsSoftmax layer input, p i stands for class label probability and y ̂ indicates label of predicted class.The outcome of proposed DCNN is estimated mini-batch gradient descent optimization (MBGD) learning algorithm.In MBGD algorithm n complete dataset is divided in to small batches b, then the model weights are reorganized utilizing model error given in Eq. (27).
The weights are updated using Eq.(28).
where,   provides model error,  stands for training samples,  denotes weights of filter kernel,  provides cost function,  shows learning rate, Ñ describes gradient of cost function and   provides number of batches.

RESULTS AND DISCUSSIONS
The anticipated SER scheme is implemented using MATLAB 2019b on the personal computer with 16GB RAM and core i7 processor on Windows environment.

Parameter configurations
The parameter configurations and initial hyper-parameters for the DCNN architecture are summarized in Table 4 and Table 5 respectively.The FHO selects 200 features with higher inter-class and lower intra-class variance which are provided to the DCNN foe emotion recognition.Table 4 describes the input size for every layer, filter size, number of filters, stride value, padding, output activations and total trainable parameters of the DCNN.The proposed DCNN requires 149120 trainable parameters which lead to the lower training time (17.20min).However, the proposed model needs 163460 trainable parameters when all features are fed to DCNN.
Figures 5 and 6 depict the training outcome and loss of the proposed DCNN for MBGD algorithm 6 respectively.The MBGDM algorithm is preferred over the ADAM, SGDM and RMSPropalgorithm because it is faster, reliable, robust for the variable data, provides better generalization during error updation, and minimizes the training duration by splitting the training data in the batch size of 64.The moderate learning rate of 0.01 is selected to avoid under-fitting and over-fitting.

Evaluation metrics
The outcome of suggested CCSER is estimated using various qualitative and quantitative outcome metrics.The precision and recall provide the qualitative and quantitative measure respectively of the proposed CCSER for different training and testing scenario.Accuracy gives the overall recognition rate and F1-score provides the balance between precision and recall.The precision, recall, accuracy and F1score are computed using Eqs.( 29)- (32).Here, TP, TN, FP and FN represent true positive, true negative, false positive and false negative rates of the recognition result.

FHO based feature selection
The FHO is employed for the different population ranging from 10 to 318.The FHO provides higher inter-class and lower intra-class variability for the 200 features selected from the multiple acoustic features.The behavior of the inter-class and intra-class variability of the features obtained using FHO algorithm is represented in Figure 7.It is observed that MFCC.Jitter, shimmer, ZCR, PF and some spectral kurtosis features always ranks in features which shows higher influence of emotion.

SER results and discussions
The outcome of the proposed multiple acoustic features (200 features selected using FHO) and DCNN is evaluated for  The cross-corpus SER evaluations are considered by training the proposed system for one corpus and testing another corpus on the system.Figures 8-11 show the accuracy, precision, recall and F1-Scorerespectively for the cross corpus SER for Hindi, Urdu, Telugu and Kannada language.When the system is trained for Hindi language and other Urdu, Telugu and Kannada are used for testing purpose then the system provides 78.50%, 51.00% and 67.00% accuracy for Urdu, Telugu and Kannada respectively.The proposed scheme shows higher accuracy for Hindi language (68.00%) when the system is trained for Urdu Language.It is observed that when Indo-Aryan languages are used for the training purpose then other Indo-Aryan give significantly better outcome compared with Dravidian languages.Also, when the system is trained for Kannada corpus then Telugu Corpus (64.00%) provides better accuracy compared with other corpuses such as Hindi (60.82%) and Urdu (62.00%).Similarly, when the system is trained for Telugu corpus then Kannada Corpus (56.50%) provides better accuracy compared with other corpuses such as Hindi (52.00%) and Urdu (46.50%).The vast changes in syllable, intonation, prosodic parameters and pitch of speech lead to the deviation in cross-corpus SER rate.The usefulness of the proposed CCSER is assessed for the multi-corpus training as described in the Table 7.For the multi-corpus training all four languages 70% samples are used for the training and individual language is tested on the trained model independently.When the corpuses are nixed together it decreases the distinctiveness of the particular corpus due language variability.The multi-corpus training results in 58.83%, 61.75%, 69.75% and 45.51% accuracy for the Hindi, Urdu, Telugu and Kannada languages respectively.It is observed that the multi-lingual training needs language adaptation to conquer the linguistic, regional and intonation variations in Indian corpuses.
The proposed system can be collaborated with the different HMI and social networking sites to understand and analyze the emotions using Indian languages.The systems can be useful for emotion annotating the movies and web series of unknown language of the user.The system has shown significant achievement for the CCSER but its performance is limited due to less dataset, larger time, and variance in regional languages in Indian corpuses.

CONCLUSION
This paper presents cross-corpus SER using multiple acoustic features and one dimensional DCNN.The outcome of proposed CCSER is evaluated on the four Indian dataset such as Hindi, Urdu, Telugu and Kannada using the performance metrics such as accuracy, precision, recall and F1-score.It provides an accuracy of 58.83%, 61.75%, 69.75% and 45.51% for Hindi, Urdu, Telugu and Kannada language respectively for multi-lingual training.The FHO based feature selection strategy provides the efficient selection of prominent features and helps to get better SER accuracy and minimize the trainable parameters.The proposed DCNN assists to boost the distinctiveness of the low-level features of voice signal.The DCNN integrates global and local features, enhancing the differentiation between emotions and promoting generalization in SER.It is found that the proposed FHO-based multiple acoustic features selection assist to increase the outcome of the CCSER that considers raw speech as the input and spectrogram as the input.The proposed CCSER system helps to improve the generalization capability of the traditional SER systems.It enhances the emotional assistance to the user interacting with person/client on online platform.In future, the outcomes of suggested DCNN based SER can be improved using domain adaptation and more global and local acoustic features.

Figure 5 .
Figure 5. Training accuracy of proposed model

Figure 6 .Figure 7 .
Figure 6.Training loss of proposed model the single corpus SER where same corpus data is used for the training and testing.

Table 1 .
Comparative analysis of various SER schemes

Table 2 .
Details about dataset used for CCSER Figure 1.Flow diagram of proposed system

Table 3 .
Details regarding features

Table 4 .
Parameter specification of proposed DCNN

Table 5 .
CNN implementation initial parameters

Table 6 .
Table 6shows that the offered scheme provides better results for the single corpus SER.It results in an accuracy of 90.52%,84.00%,90.60% and 89.00%, for the four Indian corpus such as Hindi, Urdu, Telugu and Kannada, respectively.It is observed that for the Indo-Aryan and Dravidian corpuses the single corpus SER recognition shows less variance.Outcome of proposed system for single corpus SER

Table 7 .
Outcome evaluation of proposed system for multilingual training