Classification of Salt Quality Based on the Content of Several Elements in the Salt Using Machine Learning

Salt is one of the commodities in Indonesia. Salt has a very strategic and sustainable role for human life. Apart from being used for daily consumption, salt is also used as a raw material for various industries Indonesia, as a country surrounded by coastlines, can be self-sufficient in salt production and meet domestic salt needs . However, not all the salt produced maintains sufficient quality for consumption. Therefore, monitoring of the produced salt's quality is necessary to categorize it. Even though the categorization of salt quality is still carried out manually, this research employs data mining techniques with three different algorithms: Naïve Bayes, K-Nearest Neighbor ( K-NN), and Support Vector Machine (SVM), to simplify and enhance the efficiency of the classification process. The dataset used was obtained from salt data in the Sumenep region of Madura that consists of 349 records with seven attributes: sulfate, magnesium, water content, calcium, not dissolved, NaCl(wb), and NaCl(db) with four data classes that represent grades of salt quality (K1, K2, K3, and K4), and the salt data is divided into training and testing sets using the k-fold cross-validation method. Test results indicate that the K-NN method provides better outcomes compared to other methods, with an AUC value reaching 99.0%, accuracy of 91.7%, F1 Score reaching 91.6%, precision of around 91.9%, and recall of around 91.7%.


INTRODUCTION
As one of the largest commodities in Indonesia, salt plays a crucial and essential role for humans, both in the form of table salt for consumption and as a raw material in various industries [1][2][3].In Indonesia, salt can be categorized into two main types: consumption salt and industrial salt [4,5].Consumption of salt is used in food and plays a vital role in maintaining electrolyte balance in the human body.On the other hand, industrial salt is utilized across various sectors, including the chemical, pharmaceutical, textile industries, and more.Therefore, the quality of salt is of utmost importance, both for human health and for maintaining the quality of industrial products [6].
The challenge regarding salt quality in Indonesia is still an interesting matter, even though Indonesia has significant salt production potential.Not all salt produced meets the necessary standards for consumption or use in industries.Factors such as contamination, mineral content, and concentrations of active compounds play a role in determining salt quality [7].Therefore, monitoring and testing the quality of salt are essential tasks.The mineral form of halite, or rock salt, is sometimes called common salt (NaCl) to distinguish it from a class of chemical compounds called salts.Salt, usually called rock salt (Mineral halite), is used to distinguish between chemical compounds called salts.In contrast to the world salt classification, the national salt classification is broadly grouped into two types of salt, namely consumption salt and industrial salt.
In enhancing the monitoring and control of salt quality, technologies like data mining and statistical analysis are highly valuable [8].These methods assist in categorizing salt based on its quality, which in turn helps identify salt that meets standards and can be used safely.Research involving the use of data mining techniques to classify salt quality, as mentioned earlier, represents a progressive step in optimizing salt production and usage in Indonesia.Similar to a study conducted in 2019 regarding classification using the C4.5 method, the proposed method in this research is not yet able to classify optimally due to its inability to handle datasets with a large number of classes [9].The variety of machine learning methods that can be used in the data mining process has led several researchers to study several methods in one case study to compare the results of the algorithm's performance, such as in research in 2019.This research compared two different methods, namely Naï ve Bayes and C4 .5.Based on the results obtained, the Naï ve Bayes method is superior to C4.5 in classifying a dataset.Meanwhile, in research comparing three methods for classifying numerical data, the SVM and KNN methods had more optimal accuracy results compared to the Decision Tree method where the SVM algorithm had the best accuracy in predictions with an accuracy value of 95% [10,11].
Based on previous research, the researchers were intrigued to examine the implementation of various different algorithms to construct a classification system for salt quality.The data used in this study consists of salt data obtained from the Sumenep Regency, which is the largest salt-producing region on Madura Island [12].
Overall, understanding the importance of salt and its quality in Indonesia plays a crucial role in maintaining human health and supporting various industrial sectors.With research and development efforts as described above, it is hoped that salt management in Indonesia will be further improved, both in terms of sustainable production and in terms of meeting higher quality standards.Therefore, this research proposes the KNN classification method to obtain the most optimal classification modeling system in categorizing salt quality which will be compared with several classification methods including Support Vector Machine and Naï ve Bayes.

MATERIALS AND METHODS
This study discusses the classification of salt quality using numerical data.Classification is one of the scientific disciplines in data mining which involves the process of extracting, identifying, and analyzing various information with the aim of discovering patterns within data using mathematical, statistical, artificial intelligence, and machine learning approaches [13,14].Data mining is known as the discovery of knowledge in databases.several types of data mining based on their function, including description, prediction, estimation, classification, clustering, and association [15].Figure 1 shows the steps in data mining.

Figure 1. Data mining
Here is an explanation of the steps in the data mining process: (1) Data Selection: remove irrelevant and inconsistent data.
(2) Data Cleaning: Involves merging and adding data that is relevant.
(3) Data Select: This involves choosing data to be used as the basis for analysis.
(4) Data Transformation: Before the mining process, the data is converted into a certain format.
(5) Data Mining: Involves searching for information using data mining methods.
(6) Evaluation: Involves identifying patterns from data obtained by the method and then evaluating the results and the hypothesis.

Data
The quality of salt is categorized into several classes based on its intended use.In all cases, it is important to understand the purpose of using salt and ensure that the salt used meets the required standards to maintain human health and product quality.In categorizing salt into classes K1, K2, K3, and K4, there are several features used in this study, which will be explained as follows: (1) Water Content Water content is one of the features used in this study to classify the quality of salt.This feature is employed because the moisture content in salt significantly impacts its quality.If the moisture content in salt is too high, the salt's durability will decrease [16].
(2) Not Dissolved Salt is a compound made up of sodium, and almost all of it is soluble in water, including salt itself.Because of this feature, it is necessary to use it for classifying the quality of salt, as higher-quality salt tends to have a higher solubility level.Salt that readily dissolves is highly recommended for consumption as it is beneficial for health [17].
(3) Calcium Calcium is one of the elements present in salt, thus the concentration of calcium also affects the quality of salt [18].Salt that is suitable for consumption should have a maximum calcium content of 0.06% [17].
(4) Magnesium Magnesium is also one of the elements present in salt, so the concentration of magnesium affects the quality of salt [18].Salt that is suitable for consumption should have a maximum magnesium content of 0.06% [17].
(5) Sulfate Sulfate is a compound that can decrease the NaCl content in salt, whereas good quality salt should have a minimum NaCl content of 97%.Therefore, if the sulfate content is detected to be high, it can degrade or lower the quality classification of the salt [17].
(6) NaCl (wb) and (db) In this study, the features used are NaCl wet basis and dry basis to categorize salt data into classes based on their quality.Salt that meets the food grade standard must have a minimum NaCl content of 97% [17].

Data pre-processing
Data preprocessing is the initial step before entering the model training phase, aimed at organizing the input dataset into a structured format to facilitate the training process [19].In this study, the data preprocessing stage involves data transformation to adapt the data format according to the requirements.

Data transformation
Data transformation involves modifying the scale of data into a different format to achieve the desired data distribution.Each data point undergoes similar mathematical operations as its original form [20].One method of data transformation is data normalization, which aims to adjust several variables to have a uniform range of values to prevent overly large or small values that might affect analysis outcomes.The primary objective of altering all data is to maintain the relative differences between data points.If multiple variables are present, the transformation is applied to all variables to preserve the relationships between data points [21].The process of data normalization is implemented using the Min-Max normalization technique, defined by Eq. ( 1) below: where, z: normalization result, x: x value, min(x): minimal value of x, max(x): maximal value of x.

Data mining
Within this research, the data mining procedure includes dividing the data into training and testing datasets.The training dataset is used to train the model, whereas the testing dataset is employed for validation and assessment to gauge the effectiveness of the utilized techniques.The separation of the dataset is achieved through the implementation of the K-Fold Cross Validation approach, where the parameter "k" determines the dataset divisions.For instance, when using 5-Fold validation, the dataset is divided into four subsets for training and one subset for testing, with nearly equal sample sizes in each fold [22].Subsequently, the classification process employs three different algorithms to construct a classification model capable of categorizing the quality of salt.These algorithms include K-Nearest Neighbor (KNN), Support Vector Machine and Naï ve Bayes.
The advantage of the KNN method is that apart from being easy to implement and adapt, this method has few hyperparameters.For SVM, the advantage is that this method has two free parameters called upper bound and kernel parameters.SVM also produces unique and optimal solutions, and can implement the principle of structural risk minimization (SRM) which is known to have good general performance.Meanwhile, Naive Bayes has the advantage that this method is simple, fast and has high accuracy.

K-Nearest Neighbor (KNN)
K-Nearest Neighbor is an example of instance-based learning and is often used for classification tasks, where its objective is to classify unseen data based on the stored database.The new data point is classified based on its similarity to other data points stored in the model using various similarity metrics.The algorithm determines the class of the new point by selecting the K nearest points, also known as K neighbors, to the new data and choosing the most common class among the group of data points through majority vote as the class of the new point [23].Several steps required for the implementation of this method can be outlined as follows: (1) k values that have nearest neighbors must be determined at the beginning.
(2) Calculate the squared distance between the object point and the training data.In this study, the distance between object points is calculated using the Euclidean Distance method.Euclidean distance is a method of searching between two variable points, the closer and similar they are, the smaller the distance between the two points.Which is formulated in the Eq. ( 2): where, d(x, y): Euclidian distance, X: Data 1, Y: Data 2, i: Attribute I, n: Number of attributes.
(1) After that, the results from the second step are sorted from the highest value to the lowest.
(2) Collect categories from the neighboring data based on the value of k.
(3) The final step, determines the majority category of nearest neighbors to be used to predict new data objects.
The K-NN method is often used for classifying data due to its simple and straightforward implementation, quick training process, and its applicability to data with noise.However, this method also has its drawbacks.It falls under the category of lazy algorithms, which can lead to slightly longer program execution times.It is highly sensitive when dealing with cases involving irrelevant features and requires memory storage for storing the training data records used in the process.

Naï ve Bayes
Naï ve Bayes is a simple probabilistic classification method that calculates various probability ranges by summing occurrences and combined values from the provided dataset.This algorithm employs Bayes' theorem and assumes that all attributes are independent or non-interdependent based on the value of the class variable.An alternative description states that Naï ve Bayes is a classification method that utilizes probability techniques and statistical insights developed by the British scientist, Thomas Bayes.It predicts future prospects using previous experiences as a reference foundation [10,24].
This algorithm is a probability technique used to categorize classes in a given dataset.The basic outline of the Naï ve Bayes method involves statistical analysis where initial probabilities (prior probabilities) are estimated from training data.The probabilities for each parameter are then calculated based on these initial probabilities.Its main characteristic is the strong assumption of the independence of certain phenomena or conditions.In the context of Bayes' theorem, when there are two distinct events, denoted as X and Y, Bayes' theorem can be expressed through Eq. (3) [24]: The selection of this method is relatively straightforward as it doesn't involve matrix multiplication or numerical optimization.This method is more efficient when used for predicting large amounts of data and provides a relatively high level of accuracy in its prediction outcomes.However, this method cannot be applied to case studies that involve conditional probabilities with a value of zero, as the predicted probabilities will also be zero.

Support vector machine
The method of Support Vector Machine (SVM), introduced by Boser, Guyon, and Vapnik in 1992 at the Annual Workshop on Computational Learning Theory, serves as a machine learning technique applicable for both classification and prediction purposes.SVM's core concept in classification involves identifying an optimal separator, known as a hyperplane.The hyperplane is deemed optimal when it offers the largest margin, representing twice the distance between the hyperplane and support vectors.These vectors refer to the nearest data points to the hyperplane.SVM excels in managing high-dimensional data and limited training samples due to its adherence to the Structural Risk Minimization (SRM) principle, which maximizes margin and minimizes expected risk in the face of uncertainty [25].

Figure 2. Support vector machine
Figure 2 illustrates 2 classes, each marked with distinct patterns.In Figure 2, the two classes are isolated by a ran red line known as the hyperplane.In this algorithm, the hyperplane is changed in accordance with be ideal for isolating the two distinct classes.The best optimal the resulting hyperplane is, the lower the error rate can be in the classification system.
In addition to handling linear data issues, SVM is also capable of addressing problems with data that cannot be linearly separated, also known as non-linear data.Non-linear problems can be overcome by utilizing kernels in a higherdimensional workspace [26,27].There are various variations within the SVM method, including: (1) Kernel Linier Linear kernel functions are used for linear data classification.Linear kernel is the simplest kernel function.Linear kernels are used when the data being analysed is linearly separated.

𝐾(𝑥 𝑖 , 𝑥) = 𝑥 𝑖
(4) (2) Kernel Polynomial The polynomial kernel function is a kernel function that is used when the data is not linearly separated.
Polynomial kernels are a more general form of linear kernels.In machine learning, a polynomial kernel is a kernel function suitable for use in SVMs and other kernelizations, where the kernel represents the similarity of training sample vectors in feature space.Polynomial kernels are also suitable for solving classification problems on normalized training datasets.

𝐾(𝑥 𝑖 , 𝑥) = (𝛾𝑥 𝑖
+ )  ,  > 0 (5) (3) Kernel Radial Basis Function The Radial Basis Function (RBF) kernel function is used for non-linear data classification.The RBF kernel or also called the Gaussian kernel is the kernel concept that is most widely used to solve data classification problems that cannot be separated linearly.This kernel is known to have good performance with certain parameters, and the results of training have a small error value compared to other kernels.
The following are the steps in carrying out classification using the SVM method, among others: (1) The initial stage involves computing the Hessian matrix, which results from the multiplication of the kernel function by the values of .The value of  corresponds to the vector value, namely the values 1 and -1.Calculation of the Hessian matrix using Eq.(8).
where, ij: The component of the Hessian matrix (i is row and j is column).λ: The theoretical limits that will be derived.yI: The class of i data.yj: The class of j data.
(2) Next, the second phase entails evaluating the error value utilizing Eq. ( 9), computing delta alpha through Eq. ( 10), and ascertaining the updated alpha using Eq. ( 11), outlined in the subsequent manner.
=     (11) where, EI: error score   : alpha   : delta alpha C: Constanta (3) The third step, using the equation below to find the bias value.
(4) The step four, calculating the dot product on training and testing data.
(5) The final step is to determine the class of the test data using the equation below: The SVM method is often used for in classifying data in previous studies, mainly due to its excellent accuracy and relatively easy training process.This method also adapts well to high-dimensional datasets.The balance between model complexity and error can be managed easily, and it can handle both continuous and categorical data.However, a notable drawback of this method is its difficulty in interpretation unless the features are easily understandable.Additionally, there's a lack of result transparency due to its non-parametric methodology.

Evaluation
To measure the accuracy level and performance outcomes of the algorithm, various methods can be employed.In this research, the Confusion Matrix method was used as a process for evaluation.Confusion Matrix is an evaluation method which calculates precision, accuracy, recall and F-measure of algorithm predictions based on test data [28].Table 1 is a Confusion Matrix used as a reference for calculating accuracy, precision, recall, F1 Score and AUC values.

Actual
Positive TP FP Negative FN TN Based on the confusion matrix Table 1, the values of accuracy, precision, recall, F1 Score, and AUC can be calculated using the equations below: (1) Accuracy (2) Precision (3) Recall (4) F1 Score (5) AUC where, TP: Total number of true positive predictions, TN: Total number of data instances with actual positive class but predicted as negative, FN: Total number of true negative predictions, FP: Total number of data instances with actual negative class but predicted as positive.

Data gathering
Salt quality data was taken from salt pond water in the Sumenep district, consisting of 349 records and 7 attributes: sulfate, magnesium, water content, calcium, not dissolved, NaCl(wb), and NaCl(db).The class in the data represents the grade of salt quality, categorized into 4 classes K1 -K4.Table 2 shows the dataset used in this study.

System architecture
In this section, we will elaborate on how the research is conducted using the previously described methods, as an effort to address the raised issues.The following are the steps in the classification process that will be represented through an IPO diagram presented below.

Figure 3. IPO diagram
Based on Figure 3, the classification stages are divided into three stages, namely, process input and output.The explanations for each of these three stage components are as follows: (1) Data Input Process This process marks the initial stage of data mining by inputting the dataset to be classified, where the dataset used consists of 7 salt-related features.
(2) Preprocessing Data transformation is carried out in this process, namely normalizing the data using a min max scaler so that the data used has a range of values that are not much different, namely ranging from 0 to 1.
( The output results from this classification process will categorize the attribute data on salt into classes based on the modelling that has been carried out.The salt quality level resulting from the classification of each algorithm is analysed and evaluated using the Confusion Matrix.

K-Nearest Neighbor (K-NN)
At this stage the salt dataset is classified using the KNN method with k=5.In this classification process, 3 scenario trials are used, namely on folds 5, 10 and 20.The results obtained from the application of this method can be seen in the following Table 3.In Table 3, it is shown that the AUC value is highest at k=10 and k=20, with a value of 99.0%.Furthermore, the best evaluation in general results were obtained when k=10, with an accuracy of 91.7%, an F1 Score of 91.6%, precision of 91.9%, and recall of 91.7%.

Support vector machine
By applying the Support Vector Machine method to classify salt quality, good results were obtained as seen in Table 4.These evaluation results were achieved when using the parameters indicated in Table 4 and involving dataset splitting through k-fold values, namely 5, 10, and 20.From the three tests with varying k-fold values, the best performance was achieved when k=20.While the highest AUC value was observed at k=10, the classification accuracy, F1 Score, precision, and recall were better when k=20, as shown in Table 5.Specifically, this configuration resulted in an AUC value of 87.7%, an accuracy of 71.7%, an F1 Score of 71.2%, a precision of 72.6%, and a recall of 71.7%.

Naï ve Bayes
The results obtained in this study, using the salt dataset and the Naï ve Bayes algorithm, involved dataset division through k-folds values, namely 5, 10, and 20.
Among the three tests with different k-fold values, the best performance was achieved when k=10, as indicated in Table 6.This resulted in an AUC value of 78.55%, an accuracy of 55.7%, an F1 Score of 56.0%, a precision of 56.9%, and a recall of 55.7%.The results of the tests that have been carried out are comparing the three methods, which one is more accurate.The comparison of the performance of each algorithm model can be seen in Table 7.It can be seen that the KNN algorithm is superior to the SVM and Naï ve Bayes algorithms for multivariate data types with an accuracy value of 96%.The K-NN method is very good in predictions, but when used on the multivariate data type in this study.This shows that the classification and prediction algorithm models are getting better too.Based on the results of the accuracy, precision, recall and F1-Score values, it can be concluded that the KNN algorithm has better performance than the SVM and Naive Bayes algorithms in classifying salt quality.

CONCLUSIONS
System identification allows all classification classes, and feature engineering plays an important role in increasing the accuracy of classification models.The opportunities and challenges of machine learning have been explored in this research.This shows that salt data is a source of data for classification.However, in order for this potential to be realized, it is necessary to develop methods that are effective and able to handle these conditions.
Based on the analysis and discussion conducted on the training and testing data using different k-fold values on a dataset containing 349 records and 7 attributes such as sulfate, magnesium, water content, calcium, not dissolved, NaCl(wb), where, Z: New data record, Y: Hypothesis, P(Y|Z): Probability of Y toward Z hypothesis, P(Y): Probability hypothesis Y, P(Z|Y): Probability of Z based on the condition of Y, P(Z): Probability of Z.

Table 2 . Dataset Data Water content Not dissolved Calcium Magnesium Sulfate NaCl (wb) NaCl (db) Grade
) Splitting Process Dividing training and testing data using k-fold cross validation with k values of 5, 10 and 20.The training data is essential for building the classification model, utilizing a certain portion of the overall data.The testing data, on the other hand, remains unused during the training phase and is used to validate the model's performance.(4) Classification In this process, learning is carried out to get the best classification model using machine learning methods including KNN, Support Vector Machine and Naï ve Bayes.The training results provide knowledge about the model in classifying data.Testing on both training data and test data was carried out to find out what level of salt quality is good.By comparing different classification models, this is done in order to get a comparison of which method is appropriate and accurate in classifying optimally.(5) Output

Table 3 .
Evaluation result of K-NN

Table 5 .
Evaluation result of SVM

Table 6 .
Evaluation result of Naive Bayes

Table 7 .
Comparison evaluation results of AUC values for the KNN, SVM and Naï ve Bayes methods with fold 10