JOURNAL METRICS

CiteScore 2023: 2.8 ℹCiteScore:

CiteScore is the number of citations received by a journal in one year to documents published in the three previous years, divided by the number of documents indexed in Scopus published in those same three years.

SCImago Journal Rank (SJR) 2023: 0.235 ℹSCImago Journal Rank (SJR):

The SJR is a size-independent prestige indicator that ranks journals by their 'average prestige per article'. It is based on the idea that 'all citations are not created equal'. SJR is a measure of scientific influence of journals that accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from It measures the scientific influence of the average article in a journal, it expresses how central to the global scientific discussion an average article of the journal is.

Source Normalized Impact per Paper (SNIP) 2023: 0.789 ℹSource Normalized Impact per Paper(SNIP):

SNIP measures a source’s contextual citation impact by weighting citations based on the total number of citations in a subject field. It helps you make a direct comparison of sources in different subject fields. SNIP takes into account characteristics of the source's subject field, which is the set of documents citing that source.

qqtu_pian_20240428144739.png

Intrusion Detection Models Using Supervised and Unsupervised Algorithms - A Comparative Estimation

Aswadati Sirisha^*| Kosaraju Chaitanya | Komanduri Venkata Sesha Sai Rama Krishna | Satya Sandeep Kanumalli

Department of IT, Vignan's Institute of Information and Technology, Duvvada 530046, Andhra Pradesh, India

Department of CSE, Vignan’s Nirula Institute of Technology & Science for Women, Peda Palakaluru, Guntur 522009, Andhra Pradesh, India

Corresponding Author Email:

sirishavignan1@gmail.com

Received:

31 July 2020

Revised:

23 December 2020

Accepted:

3 January 2021

Available online:

28 February 2021

| Citation

11.01_06.pdf

OPEN ACCESS

Abstract:

Intrusion Detection is a protection device that tracks and identifies inappropriate network behaviors. Several computer simulation methods for identifying network infiltrations have been suggested. The existing mechanisms are not adequate to cope with network protection threats that expand exponentially with Internet use. Unbalanced groups are one of the issues with datasets. This paper outlines the implementation and study on classification and identification of anomaly in different machine learning algorithms for network dependent intrusion. A number of balanced and unbalanced data sets are known as benchmarks for assessments by NSLKDD and CICIDS. For deciding the right range of options for app collection is the Random Forest Classifier. The chosen logistic regression, decision trees, random forest, naive bayes, nearest neighbors, K-means, isolation forest, locally-based outliers are a group of algorithms that have been monitored and unmonitored for their use. Results from implementations reveal that Random Forest beats the other approaches for supervised learning, though K-Means does better than others.

Keywords:

data balancing, intrusion detection, machine learning, supervised learning, unsupervised learning

1. Introduction

The Intrusion Detection System (IDS) is a protection framework that track network activities to verify that network operation is natural. Intrusion Detection System (IDS) Based on the extent, then appropriate steps are taken. The IDS is graded as Missuse and Anomaly in machine-based learning. IDS focused on malfunctioning learns trends from computer processing. Anomaly-based IDS may detect actions that vary from standard network behaviour. IDS based on signature or maliciosis detects proven attacks only, but IDS based on abnormalities will detect new attacks not studied from modeling. In this article, the methods used for machine learning are: regression of logistics, decision trees, random woods, Naãive Bays, K-Nearest neighbors, K-means, insulation forest and local outlier variables.

2. Comparative Study

This paper compares the following algorithms.

2.1 Logistic regression

It is a classification model that uses a logistic function to predict the probabilities of events with the data ﬁt to it. It uses a sigmoid function to map predicted values to the probabilities. The logistic function is used by this model is represented by Eq. (1):

$\log \left[\frac{p(x)}{1-p(x)}\right]=\beta_{0}+x \beta$ (1)

To predict a class that data belongs to, this method uses a threshold value. Based on the predicted value greater than the threshold, it can be classified accordingly.

2.2 Random forest

This paper uses the Random Forest algorithm for classiﬁcation. It builds a set of N decision trees, each associated with k random number of data samples. For a new sample, make each of the N trees predict the category to which the data point belongs and assign a new data point to the category that wins the majority vote. It is an ensemble method of learning, in which a strong learning group is created from a set of weak learners.

2.3 Decision trees

This paper uses Decision trees for classiﬁcation. Decision trees split the data using if-then-else conditions of the features. The decision tree’s core components are a branch, a leaf node, and a decision node. Classiﬁcation begins at the decision node, tests the features guided by that node, going down the tree at that point, then comparing the estimation of the features in the given sample. For attribute selection at each decision node, it uses one of the techniques called information gain using entropy, gini index.

2.4 Naïve bayes

Naive bayes method is based on applying Baye’s theorem, with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. We use the classification rule as Eq. (2):

$\hat{y}=\operatorname{argmax}_{y} p(y) \prod_{i=1}^{n} p\left(x_{i} \mid y\right)$ (2)

The different naive Bayes classifiers differ by the distribution of probabilities P(x_i| y).

According to the Gaussian Naïve Bayes, the likelihood of the features is given by Eq. (3):

$p\left(x_{i} \mid y\right)=\frac{1}{\sqrt{\left(2 \pi \sigma_{y}^{2}\right)}} \exp \left(-\frac{\left(x_{i}-\mu_{y}\right)^{2}}{2 \sigma_{y}^{2}}\right)$ (3)

2.5 K-nearest neighbors

In this, each time a new sample is to be classiﬁed, it computes k-instances that are nearest to the required one. The k-closest neighbors can be computed using one of the Hamming distance, Minkowski, Euclidean distance, Manhattan distance.

2.6 K-means

K-means is an unsupervised learning method that involves iterative calculations that tend to divide the dataset into K distinct clusters where each data point belongs to only one group. It ﬁrst chooses k number of clusters and calculates k centroids and then assigns each data point to the closest centroid. Again compute the new centroid of each cluster and then reassign each data point to the nearest cluster centroid and repeat this process till convergence.

2.7 Isolation forest

Isolation forest, also called iForest, is an unsupervised learning algorithm that works to isolate anomalies that are ’few and diﬀerent’ in the feature space compared to normal data points. iForest separates the samples by arbitrarily choosing an attribute and choosing a split value between the maximum and minimum estimations of that chosen attribute. This split relies upon to what extent it takes to isolate the points. Random partitioning of random trees in a forest produces shorter paths, they are considered as anomalies.

2.8 Local outlier factor

It is an anomaly detection method based on unsupervised learning that computes local density based on nearest neighbors. It compares local densities of the data points to the densities of its neighbors and identiﬁes the outliers.

The main aim of the paper is to study and summarise the work of intrusion detection models. The applications of deep learning in intrusion detection systems are specifically explored as follows: Restricted Boltzmann Machines and its variants, including Deep Belief Network (DBN) and Deep Boltzmann Machines (DBM), Convolutionary Neural Networks (CNN) and Recurrent Neural Networks, Autoencoder (AE) and its variants (RNN). The advantages are: DL-based MHMS does not require comprehensive knowledge of human labour and experts. Deep learning model implementations are not limited to particular types of devices. The drawbacks are: DL-based MHMS efficiency depends heavily on the size and consistency of datasets.

A major challenge for IDSs is the existing network traffic details, sometimes enormous in scale. Such big data slows down the entire detection process and, because of the computational difficulties in managing such data, may lead to unsatisfactory classification accuracy [1]. In IDS, machine learning technologies are typically used. Most conventional machine learning technologies, however, apply to shallow learning; they do not effectively solve the enormous problem of classification of intrusion data that occurs in the face of a real application environment for network applications. In addition, shallow learning with enormous data is incompatible with smart analysis and the predetermined criteria of high-dimensional learning.

In recent academic study, deep learning for network intrusion detection is one of the hot spots. The development of deep learning has been promoted with the enhancement of hardware computing power and the rapid growth of data volume, so that the practicality and popularity of deep learning have improved greatly [2]. Deep learning is a technique of machine learning designed to allow artificial intelligence to enhance computer systems through experience and data. In order to classify data learning, deep learning uses several nonlinear feature transformations, i.e. processing layers generated by multilayer perception mechanisms [3]. Computer vision [4], speech recognition [5], natural language processing [6], biomedicine [7], and malicious code detection [8], as well as several other fields, have been applied to deep learning. Studies on deep learning in network security have steadily appeared since 2015, drawing broad interest from academic circles. Deep learning is widely used mostly for malware detection and network intrusion detection in the two main areas of network security, and deep learning increases detection performance compared to conventional machine learning and decreases false positives [9]. Deep learning algorithms, however, get rid of the reliance on feature engineering and are able to identify attack features intelligently, helping to identify possible security threats [10].

Detection of network intrusion is one of the essential means of security protection for securing computer systems and networks. A hot topic of recent academic research is deep learning for network intrusion detection, and several literatures have suggested the efficient application of deep learning technology to solve problems with network intrusion detection [11, 12]. At present, the experimental results of deep learning detection of network intrusion are mostly differentiated between regular and attack, and there is no differentiation between attack types. The next focus is on several widely used deep learning models for intrusion detection of multiclassification networks: deep neural networks, recursive neural networks, and networks of deep belief.

3. Related Work

The section presents various works carried out by some of the authors on NSL-KDD and CICIDS in the form of Table 1.

Table 1. Previous works related to CICIDS and NSLKDD datasets

Author	Year	Dataset	Feature Selection method used	Classification model used	Performance of the model
Hakim and Fatma [1]	2019	NSL-KDD	Information Gain, Gain Ratio, ReliefF selection, Chisquare,	J48, Random Forest, Naïve Bayes, KNN	Performance is significant though there is a slight drop in accuracy
Patgiri et al. [2]	2018	NSL-KDD	Recursive Feature Elimination (RFE).	Random Forest Support Vector Machine	SVM outperforms RF.
Belavagi et al. [3]	2016	NSL-KDD	-	Random Forest, Support Vector Machine, Gaussian Naive Bayes, LogisticRegression	RF outperforms other methods
Pattawaro et al. [4]	2018	NSL-KDD	Attribute ratio	K-Means, XGBoost	Accuracy-84.41% Detection rate - 86.36% false alarm rate - 18.20%
Aung et al. [5]	2018	KDD 99	-	k-means	-
Pervez et al. [6]	2014	NSL-KDD	Merge of feature selection and classification	SVM	91% to 99% accuracy
Mashayak et al. [7]	2019	NSL-KDD	Recursive Feature Elimination	Decision Tree, Random Forest	Accuracy 99%
Abdulhammed et al. [8]	2019	CICIDS 2017	Dimensionality Reduction using Auto Encoder, PCA	Random Forest, Bayesian network, LDA, QDA	-
Desale et al. [9]	2015	NSL-KDD	Genetic Algorithm	Naive Bayes and J48	-
Meira et al. [10]	2018	NSL-KDD, ISCX	-	Nearest Neighbors, K-means, Auto Encoder, Isolation Forest	Accuracy 60%

4. Methodology

4.1 Experiment steps for supervised learning

The experiment is carried out using the steps given below: “Data set selection, Data preprocessing, Feature Selection using Random Forest, Build the models using selected features, Train the models, Test the models, Compare the performance of the models”.

Data sets selection:

In this paper, the authors have used NSL-KDD and CICIDS-2017 datasets as benchmark datasets as the IDS research community already adopts these datasets. NSL-KDD is selected because it is the traditional one, and CICIDS-2017 is selected because it is the dataset with all types of up-to-date attacks. NSL-KDD is the improved version of KDD-CUP-99, an acronym for Knowledge Discovery in Databases. CIC-IDS-2017 dataset is developed by Canadian Institute for Cybersecurity.

NSLKDD [13] and CICIDS [14] are used for binary classiﬁcation. The data proportions for binary classes (normal and attack data) identiﬁes that NSLKDD is almost balanced and CICIDS is imbalanced.

Data Preprocessing:

Preprocessing is a crucial phase in which raw data can be transformed into a standardized format. It includes data cleaning (handling null or missing values, deleting unneeded variables, handling categorical values), data normalization or scaling, data balancing, separating target variables, and splitting data into train and test.

Feature Selection:

In data preprocessing, the number of features may increase if we apply one-hot encoding for categorical columns. Even otherwise, selecting a subset of features from the existing features plays a vital role because it aﬀects the performance of the model.

Random Forest with feature importance is used for feature selection. Random Forest uses ensemble learning by combining a set of Decision Trees with controlled variance. Majority voting can be used for deciding the predictions. As the number of trees increases, the model variance decreases. Random Forests are resistant to overﬁtting. Because of all these reasons, Random Forests are chosen for feature selection. A random forest classiﬁer with a threshold of 0.01 is chosen for selecting features.

Build the models using selected features:

With the subset of features selected in the previous step, the following models are built. Logistic Regression, Random Forest, Decision Tree, Gaussian Naive Bayes, K- Nearest Neighbors.

Train the models:

Having the features selected for our dataset, the models can be trained using the train data.

Test the models: Here we use the test data to predict the labels in it and evaluate the performance metrics.

Compare the performance metrics of the models:

The performance metrics used to evaluate the models for prediction are the Confusion matrix, F1-Score, Precision, Recall, Area under ROC curve, and Accuracy.

4.1.1 Supervised learning using NSL-KDD dataset

This dataset has 41 feature columns and one label column. The 41 features are grouped into three categories: basic features related to TCP/IP connections, traﬃc features associated with the service or host, and content features extracted from packet contents. There are ﬁve diﬀerent types of labels that categorizing the data as normal or attack. The attacks are classiﬁed into four types: DOS, Probing, U2R, R2L.

DOS: To make the network resources unavailable to the user.

Probing: To explore the fragility in the network that can lead to attacks.

U2R: Invader that has user privileges but trying to get admin privileges.

R2L: Invader that has illegitimate access to the remote system.

In this paper, binary classiﬁcation of the data as normal or attack is used. The authors have used KDDTrain+ and KDDTest+ datasets for implementation. KDDTrain+ has 125973 samples and KDDTest+ has 22544 samples.

Data Preprocessing:

Preprocessing includes the following steps.

1. In NSL-KDD dataset, there are no null values or missing values.

2. All the values of the column, num_outbound_cmds contain zero for all the rows. So it is deleted because it does not aﬀect the performance.

3. There are three categorical values protocol type, service, ﬂag. One hot encoding is applied for categorical features of both train and test datasets. For protocol type, there are three unique values in train and test data sets. There are 70 unique values in the train data set and 64 unique values in the test data set for service. For the ﬂag, there are 11 unique values for train and test datasets. All the protocol type and ﬂag categorical values are one-hot encoded. All the 70 categories in the train data set and 64 categories in the test dataset are one-hotencoded for service. The remaining six categories that are missing in the test dataset are ﬁlled with zeros.

4. The target label ‘class’ is encoded as 0 for normal data and 1 for attack data using Label Encoder.

5. All the one-hot encoded data is scaled to put them in the range between 0 and 1. Standard Scaler is used for this purpose.

6. For binary classiﬁcation, data is almost balanced, so no resampling techniques are used. Data balancing is identiﬁed as shown in Figure 1.

class 0: normal: 6734333

class 1: anomaly: 5863034

Proportion: 1.15:1

After completing the data preprocessing step, the shapes of

train and test data are:

Train shape: (125973, 121)

Test shape: (22544, 121)

Feature Selection:

The authors have chosen the Random Forest classiﬁer for feature selection. Out of 121 features, 26 features are selected based on the threshold value of feature importance 0.01. Due to this, the data set size is reduced to

Train shape: (125973, 26)

Test shape: (22544, 26)

The selected features include:

[protocol_type_icmp, protocol_type_tcp, service_ecr_i, service_http, service_private, ﬂag_S0, ﬂag_SF,

srv_serror_rate, same_srv_rate, diﬀ_srv_rate, dst_host_count, dst_host_srv_count, srv_count, dst_host_rerror_rate, dst_host_srv_rerror_rate, dst_host_srv_diﬀ_host_rate, dst_host_same_srv_rate, logged_in, dst_host_serror_rate, count, src_bytes, dst_bytes, dst_host_diﬀ_srv_rate, dst_host_srv_serror_rate, dst_host_same_src_port_rate, serror_rate]

Build the models using selected features:

All the models ‘Logistic Regression, Random Forest, Decision Tree, Gaussian Naive Bayes, K- Nearest Neighbors’ are implemented using the subset of 26 features selected out of 121 features.

Train the models:

All the models are trained using the train data as

for cls in classiﬁers:

trained_model=cls.ﬁt(X_train, Y_train)

Test the models:

The models are tested with test data as

Y_pred = trained_model.predict(X_test)

1.png

Figure 1. Data balancing for NSL-KDD

2.png

Figure 2. ROC Curve for supervised learning with NSLKDD dataset

Table 2. Results of supervised learning with random forest feature selection using NSL-KDD

Model	Accuracy	F1 Score	Precision	Recall	AUC	Confusion matrix
Logistic Regression	0.722453	0.740513	0.619913	0.9 19369	0.853823	[[7359 5474] [783 8928]]
Decision Tree	0.754524	0.772488	0.642920	0.967459	0.780515	[[7615 5218] [316 9395]]
Random Forest	0.765037	0.780925	0.652543	0.972196	0.948926	[[7806 5027] [ 270 9441]]
Gaussian NB	0.743390	0.744738	0.651559	0.869014	0.819417	[[8320 4513] [1272 8439]]
K-Nearest Neighbors	0.764105	0.778545	0.653569	0.962619	0.809692	[[7878 4955] [363 9348]]

Compare the performance metrics of the models:

The models are tested with test data and the results are given in Table 2.

ROC curve for supervised learning using NSL-KDD:

ROC curve for supervised learning is obtained as shown in Figure 2. The curve indicates that Random forest occupies more area.

4.1.2 Supervised learning using CICIDS-2017 dataset

The dataset is available in two formats: PCAP ﬁles and CSV ﬁles. The authors have used CSV ﬁles for implementing their models. All these ﬁles are combined to form 78 feature columns and one label column. There are 15 diﬀerent types of attacks. They are ‘BENIGN, DoS slowloris, DoS Slowhttptest, DoS Hulk, DoS GoldenEye, Heartbleed, PortScan, DDoS, FTP-Patator, SSH-Patator, DoS Slow HTTP Test, Bot, Web Attack-Brute Force, Web Attack- XSS, Inﬁltration, Web Attack-Sql Injection’. Authors have used binary classiﬁcation to identify the traﬃc as normal or attack.

Data Preprocessing: Preprocessing includes the following steps.

1. CICIDS dataset contains inﬁnity values and null values. Inﬁnity values are replaced with NaN values. All null values are replaced with the mean of the column containing the null value.

2. Eight columns are containing 0 for all the rows. The columns are:

[Bwd PSH Flags, Bwd URG Flags, Fwd Avg Bytes/Bulk, Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd12 Avg Bytes/Bulk, Bwd Avg Packets/Bulk, Bwd Avg Bulk Rate]

The above features are deleted as they do not aﬀect the performance.

3. There are no categorical values in the dataset.

4. The target label ‘Label’ is encoded as zero for normal data and one for attack data using Label Encoder. Target labels are separated from the remaining features.

5. The data is scaled to put it in the range between 0 and 1. Standard Scaler is used for this purpose.

6. Data is identiﬁed as imbalanced for binary classiﬁcation as shown in Figure 3.

3.png

Figure 3. Data balancing for CICIDS dataset

Date shape: (2830743, 70)

class 0: Benign: 2273097

class 1: Anomaly: 557646

Proportion: 4.08: 1

7. The data is split into train data and test data. The test data size is 25% of the total data. After the data split, the size of the train and test data is:

Train_X shape: (2123057, 70)

Test_X shape: (707686, 70)

Train_y shape: (2123057,)

Test_y shape: (707686,)

8. A ‘Near Miss Under sampling’ technique is used for resampling the train data. Using this technique train data is

resampled to the average of the total samples, the reason behind that is, if we use near-miss under sampling to resample to the number of samples in the minority class, the data may cause underﬁtting.

Before Under Sampling, counts of label ‘1’: 418679

Before UnderSampling, counts of label ‘0’: 1704378

After UnderSampling, counts of label ‘1’: 418679

After UnderSampling, counts of label ‘0’: 675288

After UnderSampling, the shape of train_X: (1093967, 70)

After UnderSampling, the shape of train_y: (1093967,)

Feature selection:

Random Forest classiﬁer is used for feature selection. Out of 70 features, 27 features are selected based on the threshold value of feature importance 0.01. Because of this, the data set size is reduced to

Train_X shape: (1093967, 27)

Test_X shape: (707686, 27).

The selected features include:

[Destination Port, Total Fwd Packets, Total Backward Packets, Total Length of Fwd Packets, Fwd Packet Length Max, Fwd Packet Length Mean, Bwd Packet Length Max, Bwd Packet Length Min, Bwd Packet Length Mean, Bwd Packet Length Std, Flow Packets/s, Flow IAT Max, Fwd Packets/s, Max Packet Length, Packet Length Mean, Packet Length Std, Packet Length Variance, Average Packet Size, Avg Fwd Segment Size, Avg Bwd Segment Size, Subﬂow Fwd Packets, Subﬂow Fwd Bytes, Subﬂow Bwd Packets, Init Win bytes forward, Init Win bytes backward, act data pkt fwd, Idle Max].

Build the models using selected features:

All the models “Logistic Regression, Random Forest, Decision Tree, Gaussian Naive Bayes, K- Nearest Neighbors” are implemented using the subset of 27 features selected out of 70 features.

Train the models:

All the models are trained using the train data.

for cls in classiﬁers:

trained_model = cls.ﬁt(train_X, train_y)

Test the models:

The models are tested with test data as

Y_pred = trained_model.predict(test_X)

Compare the performance metrics of the models:

The models are tested with test data and the results are given in Table 3.

ROC curve for supervised learning using CICIDS data set:

ROC curve is obtained as shown in Figure 4. The curve indicates that Random forest occupies more area under curve.

Hyper parameters used with the models in supervised learning:

Hyper parameters used in the supervised learning algorithms are given in Table 4.

Table 3. Results of supervised learning with random forest feature selection using CICIDS

Model	Accuracy	F1 Score	Precision	Recall	AUC	Confusion matrix
Logistic Regression	0.823021	0.592122	0.540815	0.654184	0.897242	[[491531 77188] [48057 90910]]
Decision Tree	0.891597	0.774368	0.654829	0.947296	0.910645	[[499328 69391] [7324 131643]]
Random Forest	0.937743	0.841484	0.841460	0.841509	0.986115	[[546686 22033] [22025 116942]]
Gaussian NB	0.696664	0.3792802	0.317034	0.471939	0.766184	[[427436 141283] [73383 65584]]
K-Nearest Neighbors	0.906897	0.805871	0.682306	0.984089	0.950408	[[505043 63676] [2211 136756]]

4.png

Figure 4. ROC Curve for supervised learning with CICIDS

Table 4. Hyper parameters used in supervised learning

Model	Hyper parameters used
Logistic Regression	C = 1.0, Penalty = ‘L2’ Solver = ‘lbfgs’
Decision Tree	Criterion = ‘gini’
Random Forest	n_estimators = 100
K-Nearest Neighbors	n_jobs = -1, algorithm = ‘auto’ metric = ‘minkowski’

4.2 Experiment steps for unsupervised learning:

The steps used for the experiment are given in below.

“Data set selection, Data preprocessing, Select the model for anomaly detection, Classiﬁcation results”.

4.2.1 Unsupervised learning using NSL-KDD dataset

After data preprocessing (as with supervised learning), unsupervised learning models: K-means, Isolation Forest, Local outlier factor are selected for the identiﬁcation of clusters and anomaly detection. After processing is done results are obtained as given in Table 5 and Table 6.

4.2.2 Unsupervised learning using CICIDS dataset

As part of data preprocessing, inﬁnity columns are replaced with NaN. All null values are replaced with the mean of their corresponding columns. The columns with all zero values are deleted. Data normalization is done to set the data values between 0 and 1. All target labels are encoded as 0 for normal and 1 for attack data. All target labels are separated from the remaining independent variables. We need to feed these independent features to the models to learn the patterns and to prepare clusters. The number of clusters is taken as two.Predicted labels are compared with actual labels, and results obtained are given in Table 7 and Table 8.

Hyper parameters used with the models in unsupervised learning. Hyper parameters used in the unsupervised learning algorithms are given in Table 9.

Table 5. Results of unsupervised learning using NSL-KDD

Model	Clusters	Accuracy	Precision	Recall	F1 Score	Contingency matrix
K-Means	[0,1] 0 normal 1 anomaly	0.88	[0.99,0.82]	[0.76,0.99]	[0.86,0.89]	[54185 17278] [757 76297]]
Isolation Forest	[-1,1] 1 normal -1 anomaly	0.56	[0.73,0.55]	[0.15,0.95]	[0.25,0.69]	[10777 60686] [4075 72979]]
Local outlier factor	[-1,1] 1 normal -1 anomaly	0.49	[0.34,0.50]	[0.07,0.87]	[0.12,0.64]	[5041 66422] [9811 67243]]

Table 6. Results of unsupervised learning using NSL-KDD

Model	Adjusted random score	Adjusted mutual info score	Homogeneity score	Complete-ness score	V_measure	Fowlkes mallows score
K-Means	0.5732	0.5389	0.52588	0.55262	0.53892	0.79415
Isolation Forest	0.0154	0.0268	0.0197	0.04202	0.0268	0.64678
Local outlier factor	-0.00020	0.00895	0.00658	0.01402	0.0089	0.64068

Table 7. Results of unsupervised learning using CICIDS

Model	Clusters	Accuracy	Precision	Recall	F1 Score	Contingency matrix
K-Means	[0,1] 0-normal 1-anomaly	0.79	[0.84,0.46]	[0.91,0.31]	[0.88,0.37]	[2078680 194417] [389423 168223]]
Isolation Forest	[-1,1] 1-normal -1-anomaly	0.79	[0.45,0.83]	[0.23,0.93]	[0.30,0.88]	[126033 431613] [157042 2116055]]
Local Outlier factor	[-1,1] 1-normal -1-anomaly	0.56	[0.55,0.73]	[0.07,0.95]	[0.24,0.68]	[10477 60486] [4099 72999]

Table 8. Results of unsupervised learning using CICIDS

Model	Adjusted random score	Adjusted mutual info score	Homogen-eity score	Complete-ness score	Vmeasure	Fowlkes mallows score
K-Means	0.1781	0.0628	0.0556	0.07216	0.06285	0.77735
Isolation Forest	0.1387	0.0439	0.03634	0.0554	0.04391	0.78415
Local Outlier factor	0.0147	0.02468	0.0187	0.04102	0.02652	0.6366

Table 9. Hyper parameters used with the models in unsupervised learning

Model	Hyper parameters used
K-Means	init = ‘k-means++’ n_clusters = 2
Isolation Forest	n_estimators=100, contamination=0.1
Local Outlier Factor	contamination='auto', n_jobs= -1

5. Results and Discussion

In supervised learning, with the NSL-KDD dataset, among all the models that are used, Random forest and K-NN are showing better performance than other models with an accuracy of 76%. For all the models, recall values are higher than precision values, which means that false negatives are lesser than false positives. From a network security perspective, it is required to have a less false-negative rate. With the CICIDS dataset, the Random forest outperforms other models with an accuracy of 93%. Precision and recall values are almost the same for the random forest. Also, it occupies more area in the ROC curve plot. After Random forest, KNN and Decision Tree algorithms show better performance. The metrics accuracy, precision, recall, f1 score, confusion matrix, classiﬁcation report are evaluated and presented in the tables. In unsupervised learning, with NSL-KDD and CICIDS datasets, K-means is showing better accuracy. However, the problem observed is that it depends on the random seed. The best accuracy observed is 88% with NSL-KDD and 79% with CICIDS. A new column is added with the actual labels [0, 1] changed to [1, -1] in both the datasets, comparing the outlier labels with the actual labels and then evaluating all the metrics for Isolation forest and Local outlier factor algorithms. The outliers are represented with a negative one value. Vmeasure is the harmonic mean of homogeneity and completeness score. Fowlkes mallows score is the geometric mean of pairwise precision and recall values. The Adjusted random score, adjusted mutual info score, Homogeneity score, Completeness score, Vmeasure, and Fowlkes mallows score are used for internal evaluation based on the data [15]. Other metrics accuracy, precision, recall, and f1 score are used for external evaluation to quantify the quality of predictions.

6. Conclusion

This paper presents a comparative study of supervised and unsupervised algorithms using NSL-KDD and CICIDS datasets. For supervised learning, a random forest is used for feature selection. The threshold value of 0.01 for feature importance is used for feature selection in training and testing. Using these features, the models are evaluated for both the datasets. With CICIDS, since the data is imbalanced, Near Miss under-sampling is used for balancing the data. The result of this under-sampling data with the selected features using random forest, the models are evaluated and quantiﬁed the predictions. Unsupervised learning models are used for clustering and anomaly detection. With supervised learning, Random forest and KNN are performs better than other algorithms. With unsupervised learning, K-Means performs better.

References

[1] Hakim, L., Fatma, R. (2019). Influence analysis of feature selection to network intrusion detection system performance using NSL-KDD dataset. In 2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE), pp. 217-220. https://doi.org/10.1109/icomitee.2019.8920961

[2] Patgiri, R., Varshney, U., Akutota, T., Kunde, R. (2018). An investigation on intrusion detection system using machine learning. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1684-1691. https://doi.org/10.1109/ssci.2018.8628676

[3] Belavagi, M.C., Muniyal, B. (2016). Performance evaluation of supervised machine learning algorithms for intrusion detection. Procedia Computer Science, 89: 117-123. https://doi.org/10.1016/j.procs.2016.06.016

[4] Pattawaro, A., Polprasert, C. (2018). Anomaly-Based Network intrusion detection system through feature selection and hybrid machine learning technique. In 2018 16th International Conference on ICT and Knowledge Engineering (ICT&KE), pp. 1-6. https://doi.org/10.1109/ictke.2018.8612331

[5] Aung, Y.Y., Min, M.M. (2018). An analysis of K-means algorithm based network intrusion detection system. Advances in Science, Technology and Engineering Systems Journal, 3(1): 496-501. https://doi.org/10.25046/aj030160

[6] Pervez, M.S., Farid, D.M. (2014). Feature selection and intrusion classification in NSL-KDD cup 99 dataset employing SVMs. In The 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014), pp. 1-6. https://doi.org/10.1109/skima.2014.7083539

[7] Mashayak, S.A., Bombade, B.R. (2019). Network intrusion detection exploitation machine learning strategies with the utilization of feature elimination mechanism. International Journal of Computer Sciences and Engineering, 7(5): 1292-1300. https://doi.org/10.26438/ijcse/v7i5.12921300

[8] Abdulhammed, R., Musafer, H., Alessa, A., Faezipour, M., Abuzneid, A. (2019). Features dimensionality reduction approaches for machine learning based network intrusion detection. Electronics, 8(3): 322. https://doi.org/10.3390/electronics8030322

[9] Desale, K.S., Ade, R. (2015). Genetic algorithm based feature selection approach for effective intrusion detection system. In 2015 International Conference on Computer Communication and Informatics (ICCCI), pp. 1-6. https://doi.org/10.1109/iccci.2015.7218109

[10] Meira, J., Andrade, R., Praça, I., Carneiro, J., Marreiros, G. (2018). Comparative results with unsupervised techniques in cyber attack novelty detection. In International Symposium on Ambient Intelligence, pp. 103-112. https://doi.org/10.3390/proceedings2181191

[11] Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A. (2018). Toward generating a new intrusion detection dataset and intrusion traffic characterization. In ICISSp, pp. 108-116. https://doi.org/10.5220/0006639801080116

[12] Aksu, D., Üstebay, S., Aydin, M.A., Atmaca, T. (2018). Intrusion detection with comparative analysis of supervised learning techniques and fisher score feature selection algorithm. In International Symposium on Computer and Information Sciences, pp. 141-149. https://doi.org/10.1007/978-3-030-00840-6_16

[13] NSL-KDD Data Set [Online], Available at: https://www.unb.ca/cic/datasets/nsl.html/, accessed on 6 June 2020.

[14] CICIDS 2017 Data Set [Online]. Available: https://www.unb.ca/cic/datasets/ids2017.html, accessed on 6 June 2020.

[15] Clustering metrics accessed from https://scikit-learn.org/stable/modules/clustering.html, accessed on 6 June 2020.

IJHT
MMEP
ACSM
EJEE
ISI
I2M
JESA
RCMA
RIA
TS
IJSDP
IJSSE
IJDNE
JNMES
IJES
EESRJ
RCES
AMA_A
AMA_B
AMA_C
AMA_D
MMC_A
MMC_B
MMC_C
MMC_D

Username
Password
Remember me

Search form

Intrusion Detection Models Using Supervised and Unsupervised Algorithms - A Comparative Estimation

1.png

2.png

3.png

4.png