Predictive Analysis of Soil Organic Matter and Moisture Content Using Image-Based Modeling

ABSTRACT


INTRODUCTION
For human beings, images are one of the best sources of information because they support deeply understanding any kind of scene.With that influence, scientists and researchers have attempted to utilize this quality visual information in different applications.Image processing involves synthesizing the images using computers.As a result, images are an enormous source of information that can be used if their features are properly examined.For getting different kinds of results, there are different processes, namely, image restoration, image compression, image enhancement, image registration, etc.For the last many years, the agriculture field has found many applications using image processing and machine learning.Some of the applications are fruit grading [1], precision farming [2], weed detection [3], crop prediction [4], soil texture classification and prediction [5], soil pH level prediction [6], etc.
One of the streaming applications is SMC and SOM prediction [7].SOM is an essential component of high-quality and healthy soil.It is comprised of organic remains from animals and plants along with material converted by microorganisms present in the soil at various decomposition stages.This shows an effect on agriculture and forestry production.Good-quality soils with steady levels of organic matter are capable of preventing and fighting soil-borne diseases.Moreover, SOM has a key role in boosting soil quality and fertility.SOM gives an idea of the extent of nitrogen supply that soil can provide for determining crop growth.Although organic matter varies greatly on the surface within the field [7], these spatial SOM statistics help to decide the management of site-wise agricultural resources, which involves applying nitrogen fertilizer and achieving a trade-off between environmental pollution reduction and crop production increase [8], which is one of the important components of precision agriculture [9].SMC is the quantity of water present in soil.It is one of the most important soil properties and has many advantages associated with knowing the amount of moisture present in it, such as the measure of the need for irrigation, the availability of nutrients and chemicals, erosion, biological activity, and compaction potential.Thus, knowing the properties of soils like SOM and SMC has an association with the health of the soil.So, it helps farmers and land managers make decisions to enhance soil conditions.Some of the challenges in estimating and assessing SOM and SMC are labor and time-exhaustive laboratory evaluation [10], supervising variability in SMC and SOM [11], diversification in space [12], and the effectiveness of component-like types of soil [13,14].
The above-mentioned challenges have made the researchers conscious of developing and generating cost-effective and rapid ways to estimate soil properties.Soil colour is one of the most result-oriented characteristics in determining soil properties.Soil colour has also been used in soil identification [15].It is a feature that has been observed to have a strong correlation with spectral reflectance features and between SOM and soil colour [16][17][18][19][20][21] and SMC and soil colour [22][23][24][25].The dark colour of the soil is essentially linked to higher values of SOM, SMC, and intrinsic soil fertility [26][27][28].Therefore, Existing research work on image processing-based SOC and SOM predictions [29][30][31][32][33] has soil colour as a main link.Dark soil colour is usually associated with high contents of organic matter; this kind of soil is fertile and capable of supporting plant growth [19].The Munsell colorimetric system [34] is a conventional method for quantifying soil colour, which involves subjective visual matching between standard colour chips and soil samples.Therefore, when accurate colour dynamics and automatic colour matching are required, the Munsell colorimetric system is a suitable method.Modern image acquisition devices have control over the drawbacks of the Munsell colorimetric system and enhance SOM predictions.In some of the research work [30], a NIR high-resolution digital camera with controlled conditions was used for SOM prediction.It was stated that SOM was associated with the intensity of all the wavebands.It was observed that the CIEL*c*h* and CIEL*u*v* models were a good fit for SOC prediction [31].A later study [32] predicted SOC in a comparative format by using different colour spaces.Other various studies also considered a stronger relationship between the colour of the soil sample and SOM [30,31,[35][36][37][38]. Later, this relationship was tested on the cell-phone application SOCIT [39].
In many studies [29,32,39,40], for SOM prediction, a prime parameter used is soil colour in order to develop a prediction model, but with no intention of using other factors like surface residue, soil moisture, or surface roughness [41].Out of all such factors, SMC is an important factor that controls the practical assessment of SOM.Generally, wet soils [42,43] are darker in colour than dry soils.Soil macro-and micropores slowly permeate with water, which changes the physical composition of the soil with an increase in SMC.As a consequence, the relative refractivity of soil particles [14] changes, which causes a change in soil colour.This in turn complicates the relationship between soil colour and SOM, which becomes one of the deciding factors for predicting SOM from images.In the above-mentioned studies, SMC was an absolute factor in predictive models as soil samples have a variable value of moisture content.Furthermore, some of the research work introduced the concept of finding a moisture content, which is a SMC threshold value.The values below or above this threshold have different soil colours [14,29].Many reasons contribute to the decision about the value of the SMC threshold.Few researchers also observed soil reflectance as a change in the visible region only until SMC reaches 20% [44][45][46][47].A study suggested a critical SMC of 15% [14].There are some proofs of the critical value of the SMC influencing the soil reflectance and the way it can impact the SOM prediction using digital images.
The objective of this study was to predict SMC and SOM using colour and textural features extracted from soil sample images (Figure 1).This paper evaluates the ability of soil sample images acquired from a smartphone camera.The models are calibrated and validated for developing predictive analyses between features extracted from images and laboratory-measured SMC and SOM.Additionally, a few best-performing features are also sorted out of all the features.Figure 2 shows the overall workflow in the form of a block diagram.

EXPERIMENT AND IMAGE ACQUISITION SETUP
A set of 250 soil samples was considered in this study.These consist of samples from five different crops: potato, sugarcane, mustard, wheat, and rice.The sample collection site is located at 25°56'30"N latitude and 83°33'40"E longitude.These samples were collected from the district Mau, Uttar Pradesh, India.For every crop, 10 fields of approximately the same size are considered.Five samples of soil are dug out of 10 fields.So, 250 soil samples comprising all five crops were captured.Multiple samples were considered from multiple fields as the fields demonstrated high spatial (within the field) variations [48] in SOM and SMC.A large number of soil samples constitute the variation in soil conditions precisely; here, 250 soil samples were collected, representing a broad variation in organic matter content (2.5-74.6%).These samples are dug out from a depth of 2 inches below the surface of the field.These samples were uniformly taken out on a white sheet for image acquisition.Images (3024×4032 pixels) were captured with a 16-megapixel smartphone camera.The device was kept 20 inches above the surface of the soil sample.The camera settings are kept in default conditions, such as exposure time = 1/30s, F-stop = f/1.8,focal length = 1.12, and images were saved as a joint photographic experts group (JPEG).At the time of creating the soil image dataset, 8 images of each sample were captured.Therefore, there are a total 2000 soil images in the dataset.Figure 1 shows some examples of the soil sample images from the dataset.
In this study, SMC was consistently maintained during laboratory experiments, similar to other studies [49], despite its quasi-normal distribution in field observations [8].The SMC and SOM were measured using the loss on ignition (LOI) method [48].Maintaining constant SMC in samples can introduce bias in feature extraction through image processing, as different soil samples vary not only in SOM but also in water holding capacity.For instance, a sample with 2.59% SOM registered an SMC of 25.51%, whereas a sample with 6.91% SOM showed an SMC of 53.47%.Figure 2 outlines the methodological steps followed: initially, soil samples are collected and the ground truth is established; subsequently, illumination normalization is applied to the images; features are then extracted for various colour models; finally, a regression model is assessed using diverse output parameters.

Segmentation of region of interest
Before providing images to any feature extraction network, the soil images are first segmented to generate the region of interest.For this purpose, two different methods are applied: the threshold method and Otsu's method.First, the threshold method is used to segment the image.Figure 3 shows the outcome of applying different values of thresholds.It is clear from this image that the method operates best at a threshold value of 0.50.It shows that the selected threshold value is best according to the fraction of pixels.The histogram from Figure 4 clearly shows why the optimal threshold value is in the range of 0.40 to 0.60.
After this, Otsu's method is used to segment the region of interest from the image.It is a better solution than determining the best threshold value by binarizing the image.However, this method relies on the fact that the image consistently consists of a background and a foreground, indicating that the histogram should clearly show only two separable distributions.Figure 5 shows the outcome of Otsu's method.Though the result of this method performs the task of segmentation, certain areas in the region of interest are predicted as background.As can be observed from the segmentation results, the threshold method generated better segmentation results than Otsu's method.So, further image processing analysis is performed on the output images of the threshold method.

Illumination normalization
In the soil image dataset, there can be illumination variation in the images due to changes in lighting, shadows, noise, and also because the images were captured on different days.Illumination normalization was performed by dividing every pixel of the segmented region of interest in soil images by the mean value of the intensity of its corresponding reference, as shown in Eq. (1): (, )  = (, )  /(, )  (1) where, (, )  = original pixel value at the ath row and bth column of soil ROI; (, )  = Illumination normalized pixel value at the ath row and bth column of soil ROI; (, )  = Mean pixel of the corresponding reference for specific waveband λ.As a suggestion for future work, a band or a colour pallet can be used as a reference for illumination normalization.

Feature extraction
After segmentation and illumination normalization, feature extraction is carried out.At the time of storing soil images, the colour space was RGB with R, G, B as the primary colour parameters.Afterwards, images were transformed into different colour models with multiple or secondary colour metrics with different information levels.In this proposed work, 34 different features were extracted, including colour features.Sample images were acquired and stored in RGB colour space.For this step, the images were converted to other colour spaces, including HSV, YIQ, YCbCr, CIEL*a*b.Along with these colour moments, a gray co-occurrence matrix was also calculated.34 features, including median R, G, B, H, S and V, mean R, G, B, H, S, V, mean gray, median gray, homogeneity, contrast, and energy were taken into account.These features have mean and median of different colour spaces.Table 1 shows the definition of these features.

MODEL DEVELOPMENT
A total of 2000 soil images were captured while acquiring the dataset from soil samples.For result evaluation, the model is analyzed using 10-fold cross-validation.The dataset is divided into calibration and validation with a ratio of 80:20.The Kennard-Stone [50][51][52] algorithm was used for this data division.Features extracted from images were used to build a predictive relationship to the laboratory-measured SMC and SOM values.The results are verified using internal and external validation.For internal validation, 10-fold crossvalidation is used, where data is divided into 80:20 for training and testing.For external validation, leave one out crossvalidation (LOOCV), where the model is trained using all the data except one image, which is used for testing.This is repeated until every single image is used for testing.A stepwise multiple linear regression (SMLR) predictive model is developed to use the relation between colour parameters SOM (and also SMC).The SMLR model is also proposed (Figure 2) between colour features and SMC, which includes SOM.At every step of forward SMLR, one of the most statistically significant features with the lowest p-value is added to the model, and accordingly, the variation in p-value and F-statistics is logged.The threshold for the p-value was set at 0.07 to exclude or include a colour feature.Multicollinearity was taken into consideration as a strong correlation is found between colour features, which can be observed in Table 2.The root mean square error (RMSE) and coefficient of determination (R 2 ) are used to evaluate the model.The subscripts c, v, and cv represent calibration, validation, and LOOCV.
where, Ye is the predicted SOM; Ym is the measured SOM, and n is the number of samples.RMSEC and RMSEv were calculated by using Eq. ( 2) too.
RPDC and RPDV were calculated by using Eq. ( 3) too.In the study, RPD classification followed Chang et al. [51].Initially, all 34 features were used to predict SOM (%) and SMC (%) as predictor variables.Internally, 10-fold crossvalidation was also performed.The difference between the ground truth and the predicted value was also taken into account.This difference function was found advantageous for pedo-transfer function development.The ratio of the standard deviation of measured values to the standard error of prediction is known as RPD [51].RPD values greater than 2 indicate that a model is qualitatively good.RPIQ is also calculated because RPD is linked closely to R 2 [52].RPIQ gives a better representation of the spread of values in the dataset.Low values of bias and RMSE, and large values of R 2 and LCCC represent higher prediction accuracies.
Z-score was described on top of the performance of each predictor, and four different analysis tests were done, namely, ANOVA (Analysis of Variance), Cubist, Correlation, and Vtreat.Predictors were given a rating in a range of 0 to 100 (with 0 as the least important and 100 as the most important), and then these were averaged to get a z-score.Cubist analysis gives information on variable importance in a range of 0 to 100; results that are not on the same scale are converted to the same range of 0 to 100.The analysis of correlation showed a 1:1 correlation between the dependent variable and each predictor.It had values between 0 (the least important) and 100 (the most important) correlation coefficients.In ANOVA, the p-value for every predictor is calculated between 0 and 100, with 0 being the lowest and 100 being the highest p-value.R 2 values are recorded between 0 and 100, with 0 as the lowest value and 100 as the highest value.These values are then added and averaged to generate the scaled values in the described range of 0 to 100; these values are z-scores for a certain predictor.After this, the four highest predictor variables were then identified as the optimum predictors for both SOM and SMC.All this development of the model was again trained on these four predictors as independent variables, and model assessment metrics were calculated.

Descriptive analysis
Observed descriptive statistics of SOM and SMC have shown a high level of alteration in coefficient of variation, COV (%); the values were between 8.42% and 123.53% for soil properties and image features (Table 2).Values of SOM are observed between 3.45% and 81.50%, with 21.05% as the mean value and 19.65% as the standard deviation.These soil samples are highly variable and were chosen to make sure that the results are versatile.Taking from the high values of SOM, observed SMC values varied between 10.12% and 120.34%, with 26.07%as the mean value and 30.12% as the standard deviation value.Mean I indicated a high value of COV, which is 123.53%.But on the other side, homogeneity was less variable, with a COV of 8.69%.

Correlation of SMC and SOM with soil colour
The advantages of soil colour were evaluated by examining the relationship between digital measurement of the colour of the soil, SOM, and SMC.Median H is weakly correlated (r = 0.16) to SMC.Median B and SMC are highly correlated (r = -0.75) to each other which is followed by median V (r = -0.72)and median Y (r = -0.70).SOM is strongly correlated with mean H (r = -0.60),then to energy (r = -0.57)and mean S (r = -0.53).The lowest correlation (r = 0.09) of SOM is with entropy.The value correlation was also significant in many cases for both SMC and SOM.

Predictive analysis
Four different study criteria are presented in radial plots (Figures 6-8) to show the comparative gravity of predictors in predicting SMC and SOM.In many studies, colour features of soil have shown a significant role in the prediction of SOM and SMC in comparison to textural features.Researchers found that colour spaces RGB and HSV were more important in predicting SMC and SOM than other colour spaces.Moreover, median values taken from the channels of colour spaces were more significant while predicting than the mean values.Significant predictors in SMC prediction are Median R and Median Cb, followed by Median Y, Median Cr, Median V, and Mean G.The less significant variable was mean S (Figure 7).In the case of SOM prediction, the most significant variable is Mean V.The least significant variable was Mean S (Figure 8).SMLR was established with 34 and 6 features.It was first calibrated and then validated against the laboratorymeasured SOM and SMC.Table 3 and Table 4 present detailed prediction statistics results obtained using 34 and 6 predictor variables.

Prediction of SMC
For predicting SMC, the model was calibrated and validated for 34 (all) predictor variables, and then, using z-score statistics, six optimal predictor variables were finalized.Based on the z-scores from the radial graph presented in Figure 6(a), Median R was identified as the most important variable in predicting SMC, followed by Median Cb, Median Y, Median Cr, Mean G, and Median V. Finally, the model is calibrated and validated for those six predictor variables.Both internal and external validation processes recorded the output metrics for the prediction of SMC.When considering all the features, the R2, LCCC, RMSE, RPD, and RPIQ values were 0.62, 0.67, 7.90%, 2.78, and 1.57, respectively, for 10-fold crossvalidation, which is the internal validation.For the test dataset, meaning external validation, R2, LCCC, RMSE, RPD, and RPIQ values were 0.67, 0.55, 7.60%, 1.64, and 1.00, respectively.For six predictor variables with internal validation, the R2, LCCC, RMSE, RPD, and RPIQ values were 0.67, 0.65, 7.50%, 5.56, and 1.79, respectively.For external validation, the R2, LCCC, RMSE, RPD, and RPIQ values were 0.56, 0.56, 6.70%, 1.87, and 1.14, respectively.

Prediction of SOM
The method of predicting SOM follows the same steps as the prediction of SMC: the model was calibrated and validated for 34 (all) predictor variables, and then, using z-score statistics, six optimal predictor variables were finalized.Based on the z-scores from the radial graph presented in Figure 6(b), Mean V was identified as the most important variable in predicting SOM, followed by Median G, Median Q, Mean H, Mean S and Mean L*.Finally, the model is calibrated and validated for those six predictor variables.The output metrics for the prediction of SMC were recorded through internal and external validation processes.When considering all the features, the R2, LCCC, RMSE, RPD, and RPIQ values were 0.75, 0.73, 7.80%, 2.13, and 1.36, respectively.For the test dataset, meaning external validation, R2, LCCC, RMSE, RPD, and RPIQ values were 0.77, 0.75, 5.5%, 1.88, and 1.07, respectively.For six predictor variables with internal validation, the R2, LCCC, RMSE, RPD, and RPIQ values were 0.81, 0.78, 7.3%, 2.11, and 1.57, respectively.For external validation, the R2, LCCC, RMSE, RPD, and RPIQ values were 0.92, 0.85, 4.4%, 1.98, 1.21, respectively.

CONCLUSION
The performance of the SMLR model was assessed for its effectiveness in providing both reasonable and rapid estimations of SOM and SMC using a dataset acquired in a laboratory setting with a smartphone camera.This costeffective method utilized a smartphone camera to classify soil properties based on texture and colour features derived from soil images.Experiments involved soil samples from five different crops, reflecting variable organic matter content.Images were captured at various heights to account for continuous changes in both SMC and SOM.Initially, all 34 features were analyzed, followed by a refined selection of 6 optimal features for both SOM and SMC.It was observed that darker soil coloration correlates strongly with higher organic and moisture content.The model underwent both internal and external calibration and validation, with performance metrics recorded for each.While this study includes samples from diverse crop fields exhibiting significant variations in organic matter, future research should explore the impact of SMC variations on soil organic content more thoroughly.Further studies should also investigate the relationship between soil colour and its properties, SMC and SOM, to enhance predictive accuracy.Table 5 shows a comparison of proposed model with existing models.

Figure 3 .
Figure 3. Result of threshold segmentation method

Figure 4 .Figure 5 .
Figure 4. Histogram for optimal value of threshold of all pixels in an image Median Middle pixel value after all the pixels is sorted in numerical order Entropy Statistical measure of randomness Contrast Measure of intensity contrast between a pixel and its neighbour over the whole image Energy Sum of squared elements in the gray level cooccurrence matrix (GLCM) Homogeneity Closeness of distribution of elements in the GLCM to the GLCM

Figure 6 .
Figure 6.Z-score for features representing the priority towards SMC and SOM prediction

Figure 7 .Figure 8 .
Figure 7. Significance of individual feature as a predictor for SMC prediction corresponding to ANOVA, Cubist, Vtreat, Correlation

Table 2 .
SOM and SMC descriptive statistics with soil colour

Table 3 .
Prediction accuracy for SMC using 34 and 6 predictor variables.The results are of 10 fold cross validation internal validation (IV) and external validation (EV)

Table 4 .
Prediction accuracy for SOM using 34 and 6 predictor variables.The results are of 10 fold cross validation internal validation (IV) and external validation (EV)

Table 5 .
Comparison of existing literature and proposed model