Facial Expression Recognition Using Data Augmentation and Transfer Learning

ABSTRACT


INTRODUCTION
Communication is significantly influenced by emotions.Numerous tasks benefit from recognizing facial emotions, including security monitoring, criminal justice systems, elearning, smart card applications, consumer satisfaction identification, and social robot [1].The primary building components of the conventional emotion identification system are emotion classification, face detection, and feature extraction [2].
To address the problems with conventional techniques, deep learning networks employ an end-to-end learning process.The size of the data is very critical in deep learning [3,4].To enhance the deep learning performance, researchers are utilizing normalizations, translations, data augmentation, adding noise, scaling methods, and cropping [5] to increase the data size.CNNs are the best-proven methods in classification and segmentation tasks [6,7].
One of the main advantages of CNN is the automatic feature extraction [8].
Automated FER is still an exciting and challenging issue in computer vision.Since people differ in how they show their expressions, FER is considered a complex issue for machine learning methods [9].
In this work, we produce an architecture network depend on VGG16 with transfer learning for facial expression recognition.We used the FER2013 dataset to analyze our design.The obtained findings demonstrate that the suggested method is particularly useful in image expression recognition using Fer 2013 dataset, resulting in enhancement in analysis of facial expression.
The paper is organized as follow: related work detailed in Section 2, Methodology in Section 3, suggested method in Section 4, and conclusion in Section 5.

RELATED WORK
Different researches have been presented for automatic FER.[9], demonstrate FER categorizing depending on static photos utilizing CNNs without feature extraction or preprocessing tasks.Based on a seven-class classification assignment, the authors achieved 61.7% accuracy on FER2013, while a state-of-the-art categorization achieved 75.2% accuracy.

Singh and Nasoz
Debnath et al. [10], is proposed FER system depend on Convolution Neural Network (CNN) called a Convnet of four layers to recognize seven facial emotions utilizing the fusion of convolution neural networks (CNN), local binary pattern (LBP) attributes, rotated BRIEF (ORB), and Oriented FaST.The method conducted on three dataset CK+, Jaffee.The results achieved are 98.13%, and 92.05% respectively.
Zahra et al. [11], this study illustrates the system design that can recognize and anticipate the facial emotion classification using feature extraction in realtime with the OpenCV library, specifically Keras and TensorFlow.The Raspberry Pi-based study design comprises of three key processes: facial feature extraction, face detection, and facial expression categorization.In research employing the CNN approach and FER-2013, the prediction outcomes of facial expressions were 65.97%.
Khan [12] used the state of arts ImageNet models and updates the classification layer with Progressive SpinalNet and SpinalNet architecture to enhance accuracy.The classification is done using the dataset FER2013, which is publicly available on Kaggle and contains over 35.000 face image datasets for seven different emotions.The final model with Progressive SpinalNet and SpinalNetsur passed all existing single stand-alone model research on FER2013 after finishing the training procedure and fine-tuning its hyperparameter.VGG SpinalNet, one of the proposed designs, has the highest single network accuracy of 74.45%.
In the study by Al-Asbaily and Bozed [13] utilized a system that fused the Classic neural networks and VGG16 model.Where the VGG16 model is utilized for feature extraction and classic neural network was utilized to classification on the FER2013 database, the system attained an accuracy of 89.31%.
The method suggested by Wang [14], involves continual confrontation training between the generator and discriminator structures of Generative Adversarial Networks to enable improved extraction of visual characteristics from a detected input set.Then, high-accuracy face expression recognition is achieved.For simulation verification, the experimental section employs CK+, JAFEE, and the FER2013 dataset.The recognition approach has clear advantages in datasets of various sizes.The rates of average accuracy of recognition are 95.6%, 96.6%, and 72.8%, respectively.

Dataset Figure 1. FER2013 dataset expression distribution [15]
The data utilized for the model was the FER2013 database obtained from the challenge of kaggle on FER2013 [15].The dataset is utilized to combine the facial expression classification model.It comprises 35,887 images, divided into 3589 experiment and 28,709 pictures of trains.To indicate the final test contains 3589 test images.Figure 1 depicts the FER2013 dataset expression distribution.

Median filter
Image filtering is a technique for reducing noise or artifacts, sharpening the contrast between adjacent regions, highlighting the contours with a particular orientation, and detecting edges.It involves convolving a kernel (square matrix) with an image.Median filtering is a nonlinear method of removing noise from images.It is widely used because it effectively reduces noise while preserving edges.It has been proven that median filtering is a dependable technique for eliminating impulsive noise without impairing edge features, and it is durable in the presence of high noise [16,17].

Image augmentation
Image data augmentation is a method that artificially increases training data size by updating dataset images.Data can result in more skillful deep learning neural networks, and augmentation approaches can provide variations of the images, allowing fit models to apply their learning to new images.Image data Generator class in the Keras deep learning neural networks Toollkit enables us for the model fitting utilizing image data augmentations.
Deep learning algorithm, such as CNN, can learn characteristics that are independent of their placement in the image.Nonetheless, augmentation can help with this transform invariant learning method helps the model learn characteristics which are also transformed invariant, such as top-to-bottom to the left-to-right ordering, levels of light in images, and more.
Data augmentation is a basic image preprocessing method implemented online or offline.Offline augmentation methods are utilized to boost small dataset sizes, whereas online augmentation approaches are mainly used to increase the size of large datasets.Image data augmentation approaches generate from original data more training data while requiring no more storage memory.In most cases, generated photos are short batches destroyed after model training.The common methods to produce new images are: (1) Flip vertically or horizontally; (2) translate; (3) Crop randomly; (4) Scale inward or outward; (5) Rotate at some degrees; (6) Add Gaussian noise to avoid over fitting and improve the capability of learning [18,19].Example on image augmentation demonstrated in Figure 2.
In the proposed system we used flip vertically, shift, and rotate, zoom methods for data augmentation.

Transfer learning
Transfer learning another interesting model for preventing over fitting [20].It operates by training a network on a large database such as Imagenet and then utilizing those weights as the initial weight in a new categorization task.The convolutional layer weights are typically copied rather than the complete network.This is highly useful because a lot of image datasets contain low-level spatial properties that may be taught more effectively with massive data.Understanding the link between domains of transmitted data is a work in progress.
Transfer learning is divided into fine tuning and feature extraction approach.The pre-trained weights and layers of VGG16 are included in our built model.We elected to freeze all pre-trained layers in this scenario.This model will extract features from frozen pre-trained layers and train a Fully-Connected layer for predictions [21,22].

VGG16
A Convnet, also known as a Convolution Neural Network, is a type of artificial neural networks.Which is comprised of an input, multiple hidden, and an output layers.VGG16 is a type of CNN.This model authors enhanced the network depth with a small (3×3) convolution filter and assessed the networks, which achieved considerable enhancement over prior-art setups [23].Resulting in around 138 trainable parameters because they raised the depth to 16-19 weight layer.Figure 3 illustrates the VGG16 architecture.

Evaluation metrics
Specificity, F1 score, Accuracy, Sensitivity (Recall), and Precision are the performance metrics employed in this paper.True-positive (TP), false-negative (FN), false-positive (FP), and true-negative (TN), metrics are utilized to define these measures [23][24][25].In this work proposed method for FER based on VGG16 with transfer learning.The model consists of preprocessing stage based on the median filter and data augmentation, and classification based on pertained VGG16 with transfer learning added a four layer to the existing vgg16 that already trained on ImageNet dataset that has one thousand classes.We froze the existing version layers and trained new layers on FER2013 dataset.A Keras Image Data Generator generates additional training data from the original data to prevent overfitting.It is done online by looping over in small blocks throughout every iteration of the optimizer.To help produce artificial images, there are some graphic parameter (e.g.shift, Rotation, Add Gaussian noise, fliping).Figure 4 illustrates the proposed method design.

𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴(𝐴𝐴𝐴𝐴𝐴𝐴) =
We used Python jupyter notebook to simulate the dataset on our model for analysis.Lenovo intel core i7, Ram 16, hard 512 SSD are used to build the model.

RESULTS AND DISCUSSION
We used Python jupyter notebook to simulate the dataset on our model for analysis.The dataset is divided into 70%training and 30%testing.The parameters are loss function 0.01, epochs=100, and Adam as optimizer.Because VGG16 layers in this model of transfer learning were trained on Imagenet, the FER2013 dataset testing and training procedure has total parameters of 14,883,399, trainable parameters of 13,146,119, and non-trainable parameters of 1,737,200.
It is shown that the last epotch training loss is 0.48, the accuracy is 96.41%, the loss of testing is 1.3794, and the testing accuracy is 90%. Figure 4 demonstrates the proposed model's efficiency on the FER2013 database.Table 1 shows that the transfer learning-based model outperformed some well-known prior research papers on the same dataset.
This work is implemented on python Juyter notebook using Keras and tensorflow library.The data is consisted of 7 classes (sad, angry, surprise, happy, fear, neutral, and disgust).The dataset composed of 35,887 images distributed on the seven classes as illustrated in Figure 1.The data is preprocessed using the median filter and then data augmentation method is applied.The data augmentation is performed on training set to boost the training set size and avoid overfitting.The images are generated using five data augmentation method: height shift (0.1), Random rotation with range (10), zoom (0.1), width shift (0.1), and horizontal flip.Figure 5 illustrates the data distribution on classes after data augmentation process.
Then weights are initialized based on pre-trained VGG16 on Imagenet, FER2013 dataset testing and training procedure has total parameters of 14,883,399, trainable parameters of 13,146,119, and non-trainable parameters of 1,737,200.The summary of VGG16 is demonstrated in Table 1, and the summary of build transfer learning model is demonstrated in Table 2.   6.It is shown that the last epoch training loss is 0.48, the accuracy is 96.41%, the loss of testing is 1.3794, and the testing accuracy is 90.Table 3 shows that the transfer learning-based model outperformed some well-known prior research papers on the same dataset.Singh and Nasoz [9] are achieved an accuracy of 75.2% with CNN.Zahra et al. [11] is used CNN Algorithm based Raspberry Pi for FER prediction and achieved an accuracy of 65.97%.Saleh et al. [7] is used VGG SpinalNet and achieved an accuracy of 74.45%.Wang [14] is used GAN and achieved 72.8% accuracy.Al-Asbaily and Bozed [13] was used VGG16 model for feature extraction and a classic neural network was used to classification and achieved 89.31% accuracy.Whereas our method used data augmentation and transfer learning with VGG16 and achieved an accuracy of 90%.

CONCLUSIONS
FER systems are used in many applications such as education, medical diagnosis, etc. Models based transfer learning allow us for knowledge transferring from one model to another to enhance and accelerate the system performance.We removed the top levels of VGG16 and placed our layers above them.We have already trained the VGG16 model on Imagenet that has one thousand classes; new layers were trained on FER2013 dataset after freezing the existing framework layers.The Keras Image Data Generators tool is used to generate from original data more training data to avoid the model overfitting.Our proposed model achieved 90% accuracy, 60% recall, F-measure of 63%, and 66% precision.For future work, the proposed model will test using different dataset, utilizing different neural network such as Alexnet.

Figure 5 .
Figure 5.The FER2013 dataset expression distribution after data augmentation

Table 1 .
Summary of the VGG16

Table 2 .
Summary of transferred learning model