Android Malware Classification Using Gain Ratio and Ensembled Machine Learning

.


INTRODUCTION
Mobile phones have grown in recent years due to their userfriendly design and multifunctional capabilities, making them an essential asset in people's daily lives [1].One of the mobile phone operating systems is Android, which was released in 2008 [2].Android has become the most used operating system on mobile devices, currently holding 70.5% of the total market share by the third quarter of 2023 [3].The large number of users makes the Android operating system the target of cyber attacks.In the middle of 2020, 10.6 million Android malware were found, and this is expected to increase because of various cases of cybercriminals on mobile devices [4].Besides, malware is constantly evolving, making it very challenging to detect.The rapid evolution of malware poses a significant threat to individual, commercial, and digital security [5].
Hackers employ reverse engineering techniques to modify and repackage harmless applications by incorporating their malicious code [6].Malware or malicious code is often embedded in Android apps to gain user device access and retrieve personal data [7].Thousands of malwares can infiltrate Google Play, the most trusted Android software download and installation service provider [8].Besides, downloading applications from unknown sources increases the risk of virus and malware infiltration.Children are at a higher risk of being tricked since many harmful programs appear to be safe applications with positive reviews [9].Thus, using trusted sources to download the application, frequently updating software, and ensuring that security is enabled can enhance the security of Android devices to avoid malware [10].
The rise in malware in Android apps presents a significant challenge.The way to stop the spread of malware is to identify and categorize its types.Previous researchers have introduced various ways to identify and categorize Android malware.Machine learning is the most popular technique for identifying and categorizing Android malware [2,11,12].Prior research has proven that machine learning and deep learning are reliable enough to classify Android malware into five categories [13].
This research focused on feature selection and ensembled classification with five machine learning algorithms: Random Forest (RF), Extra Tree (ET), Naive Bayes (NB), k-Nearest Neighbors (k-NN), and Support Vector Machine (SVM).The dataset used is CICMalDroid2020, which consists of more than 11 thousand data and 471 features.After preprocessing, the model classified five malware categories: Adware, Banking Malware, SMS Malware, Riskware, and Benign.Therefore, the experiment results show that using different feature selection and classification techniques on large datasets significantly affects the detection performance.
The contribution of this study on the Android malware classification was presented below: (1) This research focused on classifying five malware categories with ensemble machine learning classification and gain ratio feature selection on the Android malware dataset, CICMalDroid2020.
(2) After preprocessing, the data was trained using different machine learning models: Random Forest (RF), Extra Tree (ET), Naive Bayes (NB), k-Nearest Neighbors (k-NN), and Support Vector Machine (SVM).The outputs of these models were then combined using the ensembled method.
(3) Analyze the impact of using the gain ratio and ensembled method.
However, previous researchers have proposed numerous malware detection and classification models.Still, studies on Android malware detection focusing on ensembled classification and feature selection using the gain ratio are hard to find.This research conducts a study to see the impact of the gain ratio in each machine-learning technique.Besides, this research also provides the impact of ensemble classification on Android malware detection models.
The paper was divided into the following sections: Section 2 presents relevant research, Section 3 describes the feature selection procedure and the proposed ensemble machine learning model, Section 4 details the experiment's findings, and Section 5 outlines the conclusions.

RELATED WORKS
Previous research has focused on developing methods for Android malware analysis.This section discussed previous research that relates to this work.Selvaganapathy et al. [14] conducted a literature review study and summarized the possible attacks and defenses, including methods and future challenges for building effective Android malware detection and classification.Android-based malware classification can be divided into three categories: statistical analysis, dynamic analysis, and hybrid analysis [15].

Statistical analysis
Statistical analysis detects malicious software by analyzing the application manifest without executing the application [16,17].Raghuvanshi et al. [18] machine learning approaches on CICAndMal2017 datasets to identify secure applications or malware.This study got a higher accuracy of 96.27% with the Random Forest Algorithm.On the other hand, Amenova et al. [19] used deep learning algorithms to detect the Android malware.Convolutional neural network (CNN) effectively extract features from the input data, and incorporating supplementary LSTM layers enhances the accuracy of predictions.Experiments using the CICMalDroid2020 dataset show reliable prediction results with 94% accuracy with only a 3% false positive rate.

Dynamic analysis
Dynamic analysis examines the malware by collecting memory, process, and traffic by running the application through a sandbox or designated environment [16,17].Islam et al. [20] presented a dynamic analysis technique.This study showed the impact of outlier handling when used in complex malware dataset.The final prediction combines all trained model outputs, including RF, k-NN, MLP, DT, SVM, and LR, whose R2 score was more than 0.85.

Hybrid analysis
A type of analysis blends static and dynamic elements is known as a hybrid analysis.Taheri et al. [21] presented a twolayer Android malware analysis: Static Binary Classification (SBC) and Dynamic Malware Classification (DMC).The first layer used Permission and Intent features to identify malware, while the second layer used API calls to classify the malware sample from the first layer into four categories and 39 families with a Random Forest Algorithm.Using this method, the model can get a recall value of 61.2%, which has increased by 35.7% compared to previous research.

Feature selection
In machine and deep learning, implementing feature selection affects performance improvement.Many researchers have proposed numerous feature selection techniques to address this issue.Chakravarty et al. [22] compared three feature selection methods: Gain ratio, Information Gain, and Relief.The proposed method uses feature selection on four different classification algorithms, and the results showed that the gain ratio obtained higher performance in most classification algorithms and achieved 94.47% accuracy.Therefore, the author used the gain ratio in this study because this method gave reliable performance in previous studies.

Ensemble methods
Ensemble methods are commonly implemented to enhance the precision of malware identification and categorization.This technique combines multiple models and determines the weight of each model's output.This method is often called a voting classifier.Islam et al. [20] presented an effective ensemble machine learning.The study showed that the weighted voting ensemble model performs better than the individual model.However, this research did not implement the gain ratio as the feature selection method.Besides, our research focused on analyzing the impact of gain ratio and ensemble learning.

PROPOSED APPROACH
This section discussed the proposed approach for classifying Android malware.Figure 1 illustrates the experimental design of this research.There are three main stages in this study: Data Preprocessing, which consists of Data Scaling and Feature Selection, Classification, and Ensemble voting.The feature selection process eliminated 60% of data columns using the gain ratio technique.This classification resulted in predictions of 5 malware categories: Adware, Banking Malware, SMS Malware, Riskware, and Benign.

Data preprocessing
The dataset typically includes values with dissimilar units in each column and irrelevant features.These factors can adversely affect the performance of the machine learning model.Thus, data preprocessing is needed to improve the classification performance results.The author used Standard Scaling and Gain Ratio to preprocess the data in this paper.A standard scaler is a linear scaler that is very useful for accelerating algorithms using gradient descent [23].The goal of the standard scaler method is to change features, so it has a mean of zero and a standard deviation of one, as in Eq. (1).
where, µ was the mean and σ was the standard deviation.The formula Eq. ( 1) is a way to standardize x.Standardization (or z-score normalization) is a common technique in statistics and data analysis.It transforms the values of a variable so that they have a mean of 0 and a standard deviation of 1.The process involves subtracting the mean (μx) from each data point (x) to center the distribution around 0. Then, the result is divided by the standard deviation (σx) to scale the values, ensuring that they have a consistent unit of measurement.

Gain ratio
This phase focused on reducing the features of the training data.The feature selection process involves identifying and removing less significant features from a dataset to decrease the complexity of machine learning and increase the model accuracy.Previous research on malware detection has proved that the gain ratio performs better than other feature selection methods [22].Thus, this research also uses the gain ratio method in the Android malware detection model as a feature selection method.The gain ratio is a method that attempts to reduce the bias of information gain by normalizing Information Gain with Information Entropy [24].For X and Y, Information Gain can be calculated as: The gain ratio of X compared to Y equals the information gain ratio to the information entropy, which is expressed in Eq. ( 4) [24].The gain ratio is defined as the ratio between the mutual information of two random variables and the entropy of one of them.Therefore, the gain ratio (X; Y) falls from 0 to 1.A value of 1 denotes that X leads to Y completely, while 0 signifies complete independence between X and Y [25].Figure 2 shows ten features with the highest gain ratio score.After feature selection, the dataset was divided into 80% training data (Xtrain, Ytrain) and 20% testing data (Xtest, Ytest).

Machine learning and ensemble classification
This research aimed to classify five malware categories using five machine learning methods, including Random Forest (RF), Extra Tree (ET), Naï ve Bayes (NB), k-Nearest Neighbor (k-NN), and Support Vector Machine (SVM) shown in Figure 1.Random Forest (RF) was an ensemble classifier technique that generates multiple decision trees by randomly selecting subsets of training samples and features [26].Like Random Forest, the Extra Tree method combined the results of multiple decision trees for classification predictions [27].Thus, k-Nearest Neighbors (k-NN) classified a new data point by identifying its k-Nearest Neighbors and assigning it to the majority class of those neighbors [28].On the other hand, SVM aimed to find the maximum margin hyperplane.This decision boundary best separates different classes in the training data [29].The last is Naive Bayes, which utilizes probability theory and operates on the assumption that the features considered are independent [30].RF and ET are samples for a tree-based algorithm, while NB, k-NN, and SVM are non-tree-based algorithms.This research uses five different machine learning models to analyze the impact of the gain ratio feature selection in each algorithm.For implementing the machine learning algorithm, this experiment utilizes the Scikit Learn library, and the hyperparameters follow the default values of the library.Thus, the last process combined all the machine learning detection results with an ensemble classification approach.
Ensemble classification was a methodology combining multiple machine learning models rather than relying on a single algorithm.Ensemble algorithms fall under supervised learning, as they can be trained on labeled data and used for making predictions.Combining multiple models in an ensemble represents a collective hypothesis that aims to provide a more robust and accurate prediction than individual models acting alone [31,32].This study implemented the ensemble method with a hard voting approach.The majority of the chosen class from the classification determines the result of the hard voting classification.For example, in a scenario where RF, SVM, and k-NN predict Riskware, while NB and ET predict Adware, the hard voting result is Riskware as the majority output.This method was applied to 3 and 5 machine learning model combinations, shown in Table 1.

EXPERIMENT AND RESULTS
This section explained the experimental results and analyzed the impact of using a gain ratio and voting classifier for predicting five malware categories.This experiment used Google Colab and was implemented using Python for the proposed model, where the ensemble mechanism used in the proposed mode is shown in Figure 3.To conduct this experiment, the CICMalDroid2020 dataset will be split into two parts with an 80% ratio for training data and a 20% ratio for testing data.Thus, the performance of classification models is analyzed using standard evaluation metrics, such as Accuracy, Precision, Recall, and F1-Score.

Dataset
CICMalDroid2020 was a public dataset collected from 17,341 Android samples from numerous sources such as VirusTotal, AMD, and MalDozer.There are three big groups of datasets: Statistical information, Dynamic observed behaviors, and network traffic [23,33].The dataset consists of 471 features and 11,598 data, divided into five categories: 1) Adware with 1,253 data; 2) Banking Malware with 2,100 data; 3) SMS Malware with 3,904 data; 4) Riskware with 2,546 data; and 5) Benign with 1,795 data [23].The distribution of each category and data example can be seen in Figures 4 and 5, respectively.

Performance results on single machine learning with gain ratio
The large number of unimportant features causes increased data redundancy and increases the probability of overfitting.Therefore, feature selection is highly recommended as it significantly affects model training.According to Mahdavifar et al. [23], a zero-gain ratio score means a feature did not influence identifying malware class.In this experiment, 64% of features with a zero-score of gain ratio were removed, meaning that 171 out of 470 features were effective for the machine learning model.The feature selection results with the ten highest gain ratio scores are shown in Figure 2. Feature selection using Gain Ratio showed better performance with single machine learning models.Figures 6-9 show that Random Forest (RF) and Extra Tree (ET) had the highest performance rate.However, the impact of using gain ratio can be seen very clearly in Naï ve Bayes (NB), k-Nearest Neighbor (k-NN), and Support Vector Machine (SVM).For instance, SVM had 79.05% accuracy without a gain ratio, whereas with a gain ratio applied, it could reach 81.34% accuracy (see Figure 6).It had a 2.29% increase in accuracy, compared to ET and RF, which have slightly different accuracy with gain ratio applied.2, gain ratio outperformed the ensemble classification.Among the results in single method classification, RF, ET, and k-NN, which had the highest performance results, showed the highest accuracy at 94.57%.However, another RF and ET-based voting also resulted in competitive performance compared to other combinations, namely over 94.00% in accuracy, precision, recall, and F1-Score.In addition, the five-combination voting classification got an average of 92.50% precision.The results between the single and ensemble methods are different because the machine learning models with underperformed scores, such as Naive Bayes, have influenced other methods that lead to decreased performance.
Thus, RF, ET, and k-NN algorithms show the best detection accuracy of each label.Based on Figure 10, the accuracy of the Adware, Banking, SMS malware, Riskware, and Benign labels is 97.54%, 97.63%, 98.92%, 97.24%, and 97.80%, respectively.Based on the accuracy results of each label, the accuracy of the SMS malware label has the highest value.The model can have the best performance detecting SMS malware because the data from the SMS malware label has the largest amount compared to other labels.

Comparative analysis
This research compared the proposed method with the model by Nguyen et al. [13] shown in Table 3.The proposed method using gain ratio and ensemble machine learning classification still performs below Nguyen et al. [13].The result proves that gain ratio feature selection has small impact on the performance compared to Extremely Randomized Trees [13].The gain ratio calculates the probability of each attribute in the dataset, and this method is unsuitable for a dataset with a huge number of attributes.Besides, the Extremely Randomized Trees method uses a sampling method from the entire dataset while constructing the trees.Different subsets of the data may introduce different biases in the results obtained.Hence, Extra Trees prevents data bias by sampling the entire dataset, so this method is more suitable for use on datasets with many attributes.

CONCLUSIONS
Android malware detection is crucial in increasing Android security.Thus, the research proposed a method that combines the gain ratio and ensemble machine learning with five machine learning models: RF, ET, k-NN, SVM, and NB.The goal is to detect five Android malware classes: Adware, Banking Malware, SMS Malware, Riskware, and Benign.The experiment result shows that the gain ratio has increased the accuracy of the NB method by 2.59%, k-NN by 0.90%, and SVM by 2.29%.RF and ET performed slightly differently after the gain ratio because these machine learning algorithms have a decision tree base and the gain ratio method in default.The combination of RF, ET, and k-NN with an ensemble voting classifier achieved the highest performance with an accuracy of 94.57% and a 94.71% precision score.The accuracy score was slightly lower than the highest accuracy in the single RF method, which reached 94.66%.Besides, the ensemble voting classifier has better precision than the single RF method, with 94.66%.
The experiment uses the default values set by the scikit learn library, which means no special phase of hyperparameter tuning.Consequently, the performance does not surpass previous research with similar models or datasets, which can reach 97.07% in RF and 97.67% in ET.Future research analyzes combining machine or deep learning methods to increase ensembled classification performance.Thus, the data preprocessing stage, hyperparameter tuning, and outlier handling need to be provided.However, the application store can use this proposed method for threat mitigation to reduce the likelihood of malicious applications reaching users through official channels.

Figure 1 .
Figure 1.Proposed approach for Android malware classification 3.1.1Data scaling Data scaling was one of the most crucial steps in data preprocessing before building a machine learning model.One of the data scaling techniques is standardization, which aims to bind the values between [0, 1] or [-1, 1].The standardization method used in this research is standard scaler.A standard scaler is a linear scaler that is very useful for accelerating algorithms using gradient descent[23].The goal of the standard scaler method is to change features, so it has a mean of zero and a standard deviation of one, as in Eq. (1).
where, p(x) and p(y) represent the probability of x and y class, while p(x|y) is the probability of data x belongs to the class y.

Figure 2 .
Figure 2. Feature selection using gain ratio

Figure 6 .
Figure 6.Model accuracy before and after gain ratio

Figure 7 .
Figure 7. Model precision before and after gain ratio

Figure 8 .
Figure 8. Model recall before and after gain ratio

Figure 9 .
Figure 9. Model f1-score before and after gain ratio 4.3 Performance results on ensemble machine learning This section discussed the ensemble hard voting classification.Based on Table2, gain ratio outperformed the ensemble classification.Among the results in single method classification, RF, ET, and k-NN, which had the highest performance results, showed the highest accuracy at 94.57%.However, another RF and ET-based voting also resulted in competitive performance compared to other combinations, namely over 94.00% in accuracy, precision, recall, and F1-Score.In addition, the five-combination voting classification got an average of 92.50% precision.The results between the single and ensemble methods are different because the machine learning models with underperformed scores, such as Naive Bayes, have influenced other methods that lead to decreased performance.Thus, RF, ET, and k-NN algorithms show the best detection accuracy of each label.Based on Figure10, the accuracy of the Adware, Banking, SMS malware, Riskware, and Benign labels is 97.54%, 97.63%, 98.92%, 97.24%, and 97.80%, respectively.Based on the accuracy results of each label, the accuracy of the SMS malware label has the highest value.The model can have the best performance detecting SMS malware because the data from the SMS malware label has the largest amount compared to other labels.

Figure 10 .
Figure 10.Confusion matrix of RF, ET, k-NN with gain ratio

Table 2 .
Ensemble voting classification result

Table 3 .
Model comparison