Classification of Surface Defects in Steel Sheets Using Developed NasNet-Mobile CNN and Few Samples

ABSTRACT


INTRODUCTION
Hot-rolled strip steel is widely utilized in automobile manufacture, aircraft, and light industries as one of the steel industry's key products [1,2].One of the most important indications of strip steel's market competitiveness is surface quality.Because of the raw material's influence, the strip steel surface will unavoidably change due to the materials, rolling method, and external environment.In the manufacturing process, oxide scale, inclusions, scratches, and other imperfections emerge that are not visible.It not only has a negative effect on appearances, but it also decreases fatigue resistance.However, these faults cannot be completely avoided by improving the technique over time [3].Therefore, the surface fault categorization may be utilized as a reference throughout the production operation.The objective of increasing yield and minimizing manufacturing costs is achieved through suitable adjustment.A lot of issues arise during the real-time examination of steel surfaces.Some of these issues include the following: Hazardous location: Putting inspection equipment (illumination system, camera, and certain signal processing equipment) in hot rolling mills is extremely dangerous.The presence of dust, grease, grime, water droplets, and vapor is common.Furthermore, the lighting system and cameras must be protected from stress and vibration.On a daily, monthly, and annual basis, heavy equipment is moved in and out of the site.The aforementioned concerns necessitate the adoption of appropriate physical and environmental safeguards for site equipment.
Operation speed: The high working speed of surface inspection equipment is generally 20 m/s for flat steel goods and 100 m/h for long products, necessitating the use of sophisticated image processing equipment and software with a short execution time.
Surface defect types: Surface flaws in steel merchandise are quite diverse, with nine primary classes and 29 subclasses.These flaws are not governed by norms, and their features and categorization differ between factories and operators, as well as their appearance, which might alter according to variances in the manufacturing process.
HRC (hot roll coil) is the most common finished steel form in the world and an important raw material for manufacturers.It is a vital substance that necessitates precise and quick spot pricing and analysis.Many factors, ranging from raw material costs to global trade agreements, eventually influence the pricing of the carbon steel products customers purchase.The three main factors are described: Firstly, steel starts with iron ore, scrap, coking coal, and natural gas.These resources' prices are influenced by the producing countries and traded on exchanges such as CME (Chicago Mercantile Exchange).Secondly, the macroeconomic factors influencing supply and demand dynamics have a significant effect.For example, when the US administration imposed a 25% tax on steel in early 2018, the price of HRC increased in the United States.Thirdly, depending on the end-use of a product, HRC is subjected to a variety of mill treatments, many of which add value but come at an additional expense.
Steel surface inspection currently falls into two categories: conventional techniques and deep learning techniques.In the traditional category, features are extracted using Support Vector Machine (SVM) [4], Random Forest [5], k-nearest neighbor (KNN) [6], and many different other classifiers.However, because there are no obvious guidelines for the distribution of flaws on the steel images, extracting the features is challenging, resulting in difficulties utilizing the detection algorithm as well as poor recognition accuracy.The deep learning approaches are mainly based on convolutional neural networks; these CNNs are used to classify defective surfaces on steel products [7].Here, features are extracted directly from the image, which results in high accuracy, high speed, and more adaptability [8].
As a consequence, improving surface defect classification accuracy in hot-rolled strips in order to minimize the frequency of human intervention in defect classification may result in considerable economic and social benefits.On the one hand, quality inspectors may avoid working late at night, which is good for their health.On the other hand, mistakes caused by fatigue and other variables of quality inspectors will be significantly minimized, boosting the performance and productivity of the strip steel and offering higher advantages to the steel factory.In brief, the paper's contributions are as follows: • A steel surface dataset of 1800 samples is suggested from the NEU Kaggle Competition for steel surface detection launched three years ago (NEU-CLS).The dataset contains more than 87000 digital photos of steel defects.In this work, we decided to use only a tiny number of pictures (only 300 images per class) to assess the efficacy of our suggested method.

Significance of the research:
The key objective of this study is to assist small industries in pursuing the defect detection process using such little software and a few samples of defected images in a shorter time.This will lead to the good development of small mills and fewer potential operators.This research would be carried out with more improvement according to the client's needs.

RELATED WORK
Experts tended to identify problems manually, which was imprecise and error-prone [9].Furthermore, as a result of the identical flaws, various expert judgments will be formed, leading to incorrect types and classes of strip steel flaws, diminishing defect detection reliability.Recognition results based on researchers' subjective judgments are generally inadequate [10,11].
To overcome the limitations of manual identification, researchers have addressed a number of solutions based on machine learning technology.
Meta-learning-based method.It trains a meta model to acquire the knowledge of multiple tasks, such as the Model-Agnostic Meta-Learning algorithm (MAML) proposed by Finn et al. [12] and the Long Short Term Memory network (LSTM) developed by Ravi and Larochelle [13].Existing meta-learning algorithms often use an LSTM or Recurrent Neural Network (RNN) structure within the model, however these algorithms have significant temporal complexity and sluggish running speed.As a result, it is inappropriate for industrial use.
The Grayscale Covariance Matrix (GLCM) as well as the Discrete Shear Transform were used to suggest a classification approach [14].(DST).After obtaining multi-directional shear characteristics from the pictures, a GLCM calculation is done.It then performs an important aspect analysis involving highdimensional feature vectors before being passed into a support vector machine (SVM) to identify surface faults in strip steel.The fundamental disadvantage of the GLCM technique is its large matrix dimensionality, which necessitates the use of highly capable software.
In the study [15], The authors presented a unique multihyper-sphere SVM with extra information (MHSVM+) approach for revealing hidden information in defective data sets using an additive learning model.It has a higher classification accuracy on defect datasets, particularly damaged datasets.However, SVM algorithm underperforms in large data sets with noise and overlapping target classes, and underperforms when features exceed training data samples.
The authors [16] designed a one-class classification technique made up of generative adversarial networks (GAN) [17] and SVM.It trains an SVM classifier with GANgenerated features.It further enhances the loss function, thereby improving the stability of the model.Regrettably, the aforementioned standard Machine Learning techniques often need substantial feature engineering, which greatly raises costs [18].
Traditional machine learning-based algorithms, as previously indicated, are frequently impacted by defect size and noise.Furthermore, this method's accuracy is insufficient to fulfill the practical criteria of automated defect identification.Some elements must be created by hand, as well as the scope of the application is highly limited.
Deep learning-based techniques, notably convolutional neural networks (CNN), have experienced great success in image classification tasks in recent years [19,20].CNN has great characterization capabilities [21,22] and is very successful at recognizing strip surface flaws [6,8,17].
Authors [23] built on GoogLeNet [24] and improved it slightly by including identity mapping.To minimize overfitting, the dataset was augmented using the data augmentation approach.
SqueezeNet [25] was applied in the study [26] to present an end-to-end effective model.The multiple receptive field scheduling, which may provide scale-related high-level features, was added to SqueezeNet.It is beneficial to low-level feature training and it can classify strip steel surface faults fast and consistently.One of SqueezeNet's key disadvantages is its low accuracy when compared with larger and more complicated models.Authors [27] proposed a modified AlexNet [28] and SVMbased intelligent surface defect inspection system for hotrolled steel strip pictures.Due to receptive field limitations, CNN-based classification models have excellent fitting ability but poor global representation ability.Obtaining a significant number of fault samples in complicated industrial situations is difficult, therefore increasing the dataset has become a pressing issue that must be addressed.The attention mechanism, on the contrary, has been shown to enable the model to focus on more significant information, resulting in higher recognition accuracy.In contemporary research, however, attention mechanisms are rarely used to define strip steel surface defects [29].
Traditional Machine Learning methods often require considerable feature engineering, which raises the cost significantly.

Our strategy consists of four main stages:
Step 1: we preprocess the data and organize it into six types of defects (patches, crazing, pitted surface, scratches, rolled in scale, inclusion).This dataset is available on the NEU Steel detection competition website [30].
Step 2: we use the pre-trained CNN called NasNet-Mobile as the backbone of the model with which we extract the image features; the top layers will be frozen to use the ImageNet saved weights.The last block is then fully erased and replaced with an entirely new one (global average pooling, dropout, exponential linear unit (ELU) to represent the dense layers, as well as a Softmax function for the prediction and classification layer).
Step 3: we fine-tune the model with the obtained weights and switch between optimizers (ADAM optimizer, ADAMAX optimizer) to get the best results.
Step 4: we make the comparison to pick up the best finetuned model (we take into consideration the three metrics: Accuracy vs. Executing time vs. Model lightness).

Steel surface defect dataset
The NEU surface defect database includes six types of hotrolled steel strip surface flaws: rolled-in scale (RS), inclusion (In), patches (Pa), crazing (Cr), pitted surface and (PS) and scratches (Sc) [30].The database contains 1800 photos (300 for each surface fault type).Figure 1 depicts sample photos of various common faults.The dataset collection was chosen because it contains fewer photos than other databases, allowing us to compare the performance of our technique with this little quantity of data to other papers' datasets (Table 1).
A part of 80% of the data was randomly selected (there are 240 photos for each fault type.) in the NEU dataset to form the training data.The other rest (20%) is used to validate the classification of the network.All data was augmented using "Image-Data-Generator" in Tensorflow [31] and Keras [32] libraries.Rotation (0°, 45°, 90°, 180°), horizontal flipping, shearing (0.2) and zooming (0.2).Each image's pixel values were adjusted to fall within the range of [-1; 1] before being fed into our network.

Classification model-Improved NASNet-Mobile
The technique of automating the construction of neural network topology in order to get the best outcomes on a certain job is known as Neural Architecture Search (NAS).The task is to develop the architecture with few resources and as little human help as possible.Authors [36] created the NasNet architecture, a neural architecture search network that trains to obtain the most correct parameters from produced architecture using a recurrent neural network (RNN) and reinforcement learning.Designing a CNN architecture requires a long time when the material is large, for instance the ImageNet dataset.They subsequently developed an CNN framework capable of searching for the best architecture in a small set of data and then transferring the best architecture to be trained on huge datasets; this architecture is known as "learning transferable architectures".The NASNet-Mobile architecture may be scaled based on data volume.

Depthwise and pointwise convolutions
The NasNet-Mobile framework is based on depthwise separable convolutions [37], a sort of factorized convolution in which a conventional convolution is divided into a depthwise convolution and also a 1 x 1 convolution known as a pointwise convolution.NasNet-Mobile use depthwise convolution in order to apply an individual filter for each input channel.The pointwise convolution then combines the depthwise convolution outputs with a 1 x 1 convolution.A conventional convolution filters and mixes inputs in a single step to generate a new set of outputs.This is divided into two layers by the depthwise separable convolution, one for filtering and another for combining.This factorization significantly reduces processing and model size.Depthwise separable convolutions are made up of two layers, which are depthwise and pointwise.We use depthwise convolutions (input depth) to set up a single filter for each input channel.The depthwise layer output is then linearly mixed using pointwise convolution, which is a basic 1 x 1 convolution (Eq.( 1)).We study modifications in architectural configuration of each reference structure empirically (Section II).We use transfer learning from network models trained on ImageNet [38] in the simplified CNN framework by removing the block and replacing it with a new block containing global average pooling, dropout, dense, as well as a Softmax function for the last prediction layer to forecast the steel defect class (Figure 3).For the first part of training (before fine-tuning), the whole architecture is frozen except for the final created block.Following that, we unfreeze the model's top so that it may train again to the desired goal (steel fault classification).This avoids the network from over-fitting throughout training and allows its model to learn quicker and for a longer period of time, resulting in improved generalization.Using the light-weight NasNet architecture provides various advantages, including improved model training, being less prone to short dataset over-fitting, and being deployable in other embedded systems.

NASNet-Mobile-based defect classification
a.The reason of choosing NASNet-Mobile: There are three main reasons of taking this CNN as the backbone of our model.Firstly, its lightness as it takes only 23 MB in the memory which is too smaller in comparison with other models (VGG16 takes 549 MB, ResNet52 takes 232 MB, NASNet-Large takes 343 MB…etc.).Secondly, the number of parameters, this model is built with only 5.3 million parameters which is comparatively very small (for example the VGG16 is built with 143.7 million parameters, it is then 27 times larger than our NASNet-Mobile model).The last reason is that this model takes only 27 ms per inference step in a CPU and 6.7 ms per inference step in a GPU, it is then 60 times less than the EfficientNetB7 (with 1578.9 ms per inference step).
b. Modified NASNet-Mobile: NasNet-Mobile's basic model is pre-trained with 1,056 output channels for ImageNet [38] recognition.This architecture's core experimentation is around the amount of regular cells in the model.We employed three reduction cells with three regular cells in our modified NASNet-Mobile design (Figure 4).The total number of parameters is 4,376,022, of which only 106,306 (2.42% are trainable) and the rest are frozen.

Figure 4. General structure of the proposed approach
We use the pre-trained NASNet-Mobile framework as the backbone building design, which consist six cells (reduced and normal), followed by a newly constructed defect classification block that includes a convolution layer, dropout, dense, and global average pooling.The activation function is "ELU" rather than "ReLU" in the first dense layer.ELU, or Exponential Linear Unit, is a function that converges cost to zero faster and produces more accurate results [39].In contrast to other activation functions, ELU contains an extra alpha constant that needs to be positive, as seen in Eq. (2).
ELU is extremely similar to RELU, with the exception of the negative inputs.They are both in identity function form for non-negative inputs.ELU, on the other hand, smoothies progressively until their output equals − , whereas RELU smoothies substantially (Figure 5).The reason for using ELU instead of ReLU as an activation function is because ELU smoothes out gradually till it reaches  , whereas RELU smoothes out dramatically.Furthermore, unlike ReLU, ELU can provide negative outputs.
c. Advantages of exponential linear unit ELU The ELU is a continuous and differentiable activation function that offers faster training times compared to other linear non-saturating functions like ReLU with its other different versions (Leaky-ReLU (LReLU) and Parameterized-ReLU (PReLU).).It doesn't suffer from dying neurons, exploding or vanishing gradients.As compared to other activation functions like ReLU, Sigmoid, and Hyperbolic Tangent, it achieves more accuracy.
Steel surface defect classifier variables can be updated by reducing a multi-class loss function known as Categorical crossentropy (Eq.( 3)).execution time.First, data augmentation is used to get more features to be learnt by the model.Second, a new block, which we already defined, is added at the bottom of the model for the prediction part, this block will help improve accuracy and reduce model parameters and executing time.Third, we switch between optimizers to find the best one (ADAM and ADAMAX).Finally, the learning rate is reduced using the exponential decay as in Eq. ( 4) then we apply the early stopping when the model accuracy cannot improve anymore.The model restores the best weights.
where, y represents the final value, a represents the initial value, b represents the decay factor, in addition x is the value of time that has elapsed.

Model implementation
Our method is deployed under the publicly available Python framework from Google Colaboratory [40].Tesnorflow [31], Keras, Matplotlib, NumPy, and Glob are the main libraries used in this implementation.We took 80% of the photos in the NEU (NEU) collection as training data (240 images for every single fault category) and 20% as validation data.Before as well as after fine-tuning the network, performance is evaluated.Table 2 shows the values of the hyperparameters used to train this CNN.
The experiments were performed with Windows 10 Professional on the Intel® Core (TM) i5 7200U, 64-bit platform with 8GB of RAM and NVIDIA RTX 2070, as we took advantage of the free available GPU on the Google Colab Platform.The training with the surface defect dataset was so fast.It took only 3414 seconds (56 minutes and 54 seconds) to train the model before fine-tuning and 528 seconds after finetuning (8 minutes and 48 seconds).

Model evaluation
Our model was running twice, once without fine-tuning the parameters and again with fine-tuning the parameters.The following deep learning metrics are used to assess the model: accuracy, loss, recall, AUC, FP, FN, TP, TN, and precision.With different datasets for training and validation, we compare these measures before and after fine-tuning.
a. Performance of the model before fine-tuning The results are shown in the following tables and graphs, along with an analysis of each one.
The metrics in Table 3 show very promising results in both the training and validation datasets.We can note a slight decrease between them, and this is because the model learns from the training data, which makes it more reliable, but according to the validation data, we know that it has only 20% of the total data, and the model has never learned from it.Since evaluating the model on the training dataset might produce in biased results, it is tested using a held-out sampling to offer an impartial evaluation of its competence.Strategies that may be utilized to mitigate the difference in performance include model fine-tuning and dataset augmentation to ensure the model can learn additional features.Figure 6 displays the training as well as validation curves of the optimization for NasNet-Mobile developed with the previously stated dataset of 1800 photos enhanced through Image Data Generator (These findings were achieved before to fine tuning).The training spanned 20 epochs, including a break at the 12th.We can notice a declining trend as the number of epochs grows, which is followed by validation loss and training loss (Figure 7).During the learning phase, the model appears to identify the visual prominence of the reference picture and the candidate image.As a result, the loss attained during training tends to decrease.The images were chosen at random during the testing phase.These images are from a different class that has never been shown to the network during training.Consequentially, we observe that as the training steps progress, the accuracy of the set to be tested follows that of the set used for training while remaining only slightly inferior.This shows that the algorithm, which was trained on cases of the training set, predicts the cases that weren't in the training set correctly.As we can see in the previous curves (Figure 8) and in (Table 3), the best-achieved precision is about 99.51% in the training and 98.6% for the validation.The recall is about 99.51% for training and 97.78% for validation dataset.These results were obtained before the fine-tuning.
b. Performance of the model after fine-tuning  The training process was early stopped, as shown in Figure 9, because there was no improvement during three consecutive epochs.We can clearly see the instability of both accuracy and loss in the training and validation datasets, as shown in Table 4, even though the accuracy of the model can reach 100%, which is very good compared with such a small amount of data (only 300 images per class, whereas 240 for training and 60 for validation).This instability is caused by the absence of batch normalization in the experiments (Figures 9 and 10).Batch normalization enhances the training process by reducing internal covariate shifts, improving stability, and optimizing the model.It also improves generalization by normalizing layer activations, reducing overfitting, and reducing initial weight sensitivity.Batch normalization also allows for higher learning rates, accelerating the training phase as well as minimizing the need for precise initialization (no batch normalization layer was included in our architecture in the first experiment).To overcome this issue, we should address model fine-tuning, like adding different batch normalization layers and training the model again.

Comparative study
In this part, we will compare our model to prominent steel surface inspection methodologies already in use.We take into consideration not only the accuracy but also the terms of execution time, model lightness (number of parameters), and data size.According to Table 5, we can obviously say that our model beats the aforementioned models regard to accuracy, while it achieves a 99.51%, which is higher in comparison with other accuracies.The proposed NasNet-Mobile can reach very satisfying results with lower model parameters and running time, as well as it doesn't require large data: 48 times lower than the model [34], 7 times lower than the model [33] and 14 times lower than the model [41].Our model seems to be the lighter (only 5.3 million parameters which is so low compared to DeCAF model [41] with 60 million parameters and 25.6 million parameters of residual neural network [22].
Finally, the running time of our network is faster than the other's executing time.As we can see, it is three times less than the model [33] and 2.5 times less than the model [22].The error rate of our model is the lowest compared to other models error rates.In Table 6, the error rate of our model is 0.028, which is 22 times less than the CNN method with LU's activation function.This means that our model can learn better with fewer errors and better accuracy.

CONCLUSIONS
In this paper, we have suggested a novel method based on the pre-trained NASNet-Mobile CNN to classify defects in steel sheets.Here are the main findings of the research: The issue of accurate classification when the memory is insufficient is set using the NASNet-Mobile network, which has a small number of parameters compared to other CNNs (5.3 million parameters).The top layers of this CNN were frozen, which helped to use less memory and calculations without losing weight.The long executing time dilemma is fixed using the free GPU available on the Google Colab Platform.The problem of dataset scarcity is addressed in this paper and solved due to the potency of this CNN, which can get the necessary features of the image due to its long depth (389 layers), which can extract more features even though the dataset number is tiny.
The modification of hyper-parameters gave an improvement to the model when fine-tuned.We found out that the ADAMAX optimizer is better than the ADAM optimizer in this modified NasNet-Mobile architecture.Thus, we could note a slight improvement in both accuracy and error rate.The Adam optimizer adjusts weights in inverse proportion to the scaled L2 norm of previous gradients, whereas AdaMax expands this to the infinite norm of previous gradients.
The model has been justified, and the last block was entirely removed and outright replaced with a new block.Hence, other modifications were adopted to match the requirements of steel image features.An ELU activation function was set in the convolution layers.We meant to use this activation function, and this is to exploit its advantages.This activation function helps solve the dying RELU problem where the gradient value is 0 on the graph's negative side.As a consequence, the weights and biases of certain neurons are not updated throughout the backpropagation process.This can result in dead neurons that are never triggered.By inserting a log curve for negative input values, ELU can address the RELU dying problem.It then assists the network in adjusting weights and biases in the proper direction.The ELU activation function can increase model performance and network robustness.It likewise gradually smoothies until its output equals zero, whereas RELU dramatically smoothes.
A dropout (0.2) was added after each fully connected layer, and a global average pooling was placed before the dense layer, which helped minimize time and memory during the model implementation.Learning rate scheduling enables us to make use of larger steps for the first couple epochs, then gradually lower the number of steps as the weights approach their ideal value.
More improvements can be achieved by gathering more training data and/or strengthening the network's architecture and fine-tuning its hyper-parameters instead of raising the training epochs under the present structure, which might result in overtraining.
In general, we can conclude that the suggested algorithm can be adopted in image processing tasks such as classification tasks to overcome the main three challenges : time consumption, memory insufficiency, and limited data confronts in small mills and non-powerful operators.

Figure 3 .
Figure 3. Convolution cell block acquired via RNN exploration K is the depthwise convolutional kernel with a size of Dk x Dk x M Where the m th filter in  ̂ is applied to the m th channel in F to output the m th channel of the filtered output feature map Ĝ.As shown in Figure 2(a) and Figure 2(b), RNN merges two hidden layers to move on to the following hidden layer.We study modifications in architectural configuration of each reference structure empirically (Section II).We use transfer learning from network models trained on ImageNet[38] in the simplified CNN framework by removing the block and replacing it with a new block containing global average pooling, dropout, dense, as well as a Softmax function for the last prediction layer to forecast the steel defect class (Figure3).For the first part of training (before fine-tuning), the whole represents the i-th scalar value in the model output,  indicates the equivalent goal value, and output size refers to the number of scalar values in the output of the model.

Figure 5 .
Figure 5. Graph showing the difference between ELU (green) and ReLU (red) activation functions [39] d.Model optimization for Steel surface defect Following the development of the basic NasNet-Mobile model for steel surface defect inspection, we propose many viable strategies for improving accuracy and reducing

Figure 6 .
Figure 6.Accuracy and loss curves before fine-tuning the NasNet-Mobile

Figure 7 .
Figure 7. Precision and accuracy curves before fine-tuning the NasNet-Mobile Precision is the proportion of properly classified examples (5), while recall (also known as sensitivity) is the proportion of recovered relevant instances(6).Relevance thus determines precision and recall.

Table 2 .
Hyperparameters used to train the NasNet-Mobile convolutional neural network

Table 3 .
Performance metrics of NasNet-Mobile in the training and validation dataset before fine-tuning

Table 4 .
Performance metrics of NasNet-Mobile in the training and validation dataset after fine-tuning Figure 8. Training with validation loss

Table 5 .
The classification accuracy (%) for several state-of-the-art steel surface fault classifiers taking into account the triplet model lightness vs. running time vs. data size

Table 6 .
The evaluation of classification error depending on various activation functions, the model uses the name of the activation function wrapped in parenthesis