Soft Sensor Modelling Method Using Improved LWPLS for Fermentation Monitoring of Pichia

In the fermentation process of Pichia pastoris, inherent non-linearity and significant time-variance characteristics are observed, making pivotal state variables challenging to measure online. In this study, a novel online soft sensor modelling approach for the Pichia pastoris fermentation process is introduced. A just-in-time learning (JITL) technique, driven by a multi-similarity measurement coupled with a moving window (MW) strategy, was employed. Historical data were partitioned into real-time multiple samples via the MW technique. Subsequent sub-windows were then filtered, adopting a cumulative similarity strategy. The k-MI algorithm was utilised for the selection of local auxiliary variables within the MW, leading to the construction of the local weighted partial least squares model (LWPLS) via a multi-similarity metric-driven JITL. The fusion of sub-models was accomplished through double weighted ensemble learning. Predictive outcomes indicated superior performance of the proposed soft sensor model in estimating the cell and product concentration of Pichia pastoris in comparison to alternative models.


INTRODUCTION
Pichia pastoris is a significant research target in the field of pharmaceutical proteins.Compared with other existing expression systems, Pichia pastoris has obvious advantages in the processing of expression products, external secretion, posttranslational modification, and glycosylation modification, and has been widely used in the expression of exogenous proteins [1][2][3][4].In order to give full play to Pichia pastoris expression system as an excellent protein expression system, maximize the production efficiency and product quality of Pichia pastoris fermentation to express exogenous proteins, and reduce the production cost [5,6], it is necessary to conduct dynamic regulation and real-time optimization of the fermentation process of Pichia pastoris, so as to accurately control the production under the optimal technological conditions.However, the fermentation process of Pichia pastoris has the characteristics of multi-variable, strong coupling and nonlinear [4].Due to the actual technology and cost, the key parameters that can directly reflect the quality of Pichia pastoris, such as cell concentration, substrate concentration and product concentration, are difficult to be directly measured online.In the absence of cost-effective online detection methods, most use off-line sampling and analysis methods, which makes Pichia pastoris extremely vulnerable to contamination and causes the entire fermentation process to fail.In addition, the off-line laboratory analysis steps are tedious, the data collection interval is long, the lag is large, the real-time performance is poor, these problems become the technical bottleneck of optimization control for the fermentation process of Pichia pastoris, and the soft-sensor technology is an effective way to solve the above problems [7].
Soft sensors enable the prediction of the dominant variable through the auxiliary variable by building a mathematical model between the dominant and auxiliary variables [8].In addition, most traditional soft sensor models use an offline approach to construct mathematical models.For example, Huang et al. [9] proposed the use of least squares support vector machines (LS-SVM) and GPC to construct predictive models and predict output values, and the use of particle swarm optimization (PSO) algorithms to implement rolling optimization.Dave et al. [10] used artificial neural networks with genetic algorithms (ANN-GA) to predict bioethanol production.Yang et al. [11] used nonlinear models (RF, SVR) to predict the fermentation of black tea, which is important to achieve standardization, information, and intelligent processing of black tea fermentation.However, these modeling methods are offline, and as the working environment changes or the fermenting cells become contaminated, the offline models will fail and may cause significant financial losses.Therefore, online learning and real-time modeling techniques are better suited to soft sensor construction.Ren et al. [12] used locally weighted partial least squares (LWPLS) as a modeling method for soft sensors.They used particle swarm PSO to optimize the bandwidth parameters, and the results showed that the PSO-LWPLS model has good predictive performance.Yamada and Kaneko [13] used a Gaussian mixture model to partition the data set, used a genetic algorithm to select explanatory variables, and finally constructed an online nonlinear adaptive soft sensor for each stage of explanatory variables.The results show that the constructed adaptive soft sensor (EGAVDS-LWPLS) can accurately predict the values of the target variables in each process state.
Based on this analysis, in order to make up for the shortcomings of the offline model, this paper adopts the realtime learning strategy.At the same time, considering the computer's limited computing power in practical applications and the lag of the prediction process, a MW approach is used to divide the data subsets to reduce the time to search for samples.Moreover, to ensure the accuracy of the search, the LWPLS model is driven by a multi-similarity measurement approach.Finally, the predictions are fused and output using a double-weighted integrated learning approach.Comparative results show that the Ek-MW-LWPLS model has excellent prediction results.

Soft sensor modelling founded on the principles of JITL
JITL has been recognized as a non-linear local modeling paradigm.Contrary to conventional offline holistic modeling strategies, which typically necessitate a comprehensive global model, JITL is distinct in its approach: data aptly suited for the task is selectively extracted from an extensive database to construct a localized model based on specific query prerequisites.Subsequent to this model formulation, output predictions are generated.As an online real-time strategy, JITL stands in contrast to static approaches.However, it must be noted that the integration of non-linear models with JITL could inadvertently amplify computational demands.Given this, a proclivity towards linear techniques has been observed, with particular emphasis placed on LWPLS [14].Within this context, the LWPLS algorithm is employed to establish a local model, achieved by assessing the similarity between query points and all encompassed database data.
Let it be postulated that X∈R N×M and Y=R N×L represent the input and output matrices of the query samples respectively.Herein, N signifies the volume of associated training samples, while M and L demarcate the dimensionalities of the input and output samples in that order.Upon the arrival of a query sample xq, a pivotal step involves determining the similarity between this sample and historically recorded samples.Consequently, each sample is allocated a weight ω={ω1, ω2, …, ωN}, hinging on the detected similarity, and the weight matrix of analogous samples is represented as Ω=diag{ω}.With these preliminary steps completed, the PLS algorithm is set in motion to undertake the ultimate predictive analysis.The prescribed methodology unfolds as: Step 1: The latent variable count K is designated, with its commencing value set to 1.
Step 2: The weight matrix Ω is computed.
where, 1N×1 embodies a column vector, consistently populated with elements valued at 1.It is pertinent to emphasize that the superscript brackets primarily indicate the vector or matrix pertinent to the kth latent variable [14].
Step 4: An initial estimation of the query output is set as  ̂(0) = .
Step 5: For the kth latent variable, computations for X (k) and the query sample   () are executed as: In this framework, v (k) epitomizes the eigenvector associated with the maximal eigenvalue of the covariance matrix  ()  ()  () .
Step 6: Calculations for the kth load vector p (k) and the regression coefficient vector q (k) are undertaken.
Step 7: The query sample output undergoes an update to  ̂() .
Step 8: A verification is performed to ascertain if the condition k=K is realized.If found to be true, the output derived from Eq. ( 10) is embraced as the conclusive outcome; if not, the process proceeds to Step 9.
Step 9: Both the input matrix X (k+1) and output matrix Y (k+1) of the training data, along with the input vector   (+1) of the query sample, are determined. ( Step 10: The variable k is incremented to k+1, redirecting the procedure back to Step 5.

Enhancement of JITL through ELWPL modelling
Intrinsic to many sample datasets is the existence of diverse similarity relations, leading to the assertion that a singular similarity measure might not adequately reflect genuine data sample relationships.To circumvent these limitations, a fusion of multi-similarity measurements has been suggested as an effective method to discern sample similarities, subsequently improving model accuracy.The culmination of these processes integrates the predictions employing ensemble learning.

Multi-similarity measurement driven modelling
The mechanism underpinning sample similarity determination directly influences the choice of sample subsets.Therefore, to more authentically represent inter-sample relationships, it becomes imperative to judiciously select the similarity measure.An approach is postulated here wherein various similarity measures are employed, resulting in the acquisition of distinct sample subsets.By this strategy, with 'n' different chosen similarity measures, the degree of similarity is denoted by S1, S2, …, Sn.Predictions are then merged, presented as a weighted average.In this context, three computational similarity techniques are integrated, aiming to steer the LWPLS model through a consolidated methodology.
Euclidean Distance Measure: The first similarity measurement, S1, was derived from the Euclidean distance method.

(
) ( ) ( ) In Eq. ( 14),  1, represents the similarity as the Euclidean distance between the query and historical samples.Eq. ( 15) elucidates  1, as the weight correlating each training sample to its similarity measurement S1, with σd being the distance vector's standard deviation and φ the local adjustment parameter.
(1 ) cos ,cos 0 where,  2, and cos(θxi) signify the distance and angle similarities between query and historical samples, respectively. 2, is then discerned for each training sample relative to the similarity measure S2.
PLS-based Latent Similarity Measure: The third similarity measure, S3, builds upon a novel technique proposed by Yuan et al. [18].In this method, the PLS algorithm is engaged with the historical dataset to unearth latent structures, commencing from a low-latitude space.The similarity amongst samples is then gauged within this latent space.
In this formulation, T and U serve as the score matrices for X and Y, respectively, whereas P and Q denote the load matrices for X and Y. E and F stand as residual matrices for X and Y.The latent score for the ith historical sample is represented by ti.Additionally, the latent variable row vector of the query sample is denoted as tq [18].The distance between the ith sample and query sample, considering supervised latent variables, is encapsulated in Eq. ( 18), with  3, describing the weight for each training sample in relation to similarity measure  2 .
( ) ( ) It's worth noting that each similarity measure uniquely defines a sample set, each harboring inherent strengths and shortcomings.Hence, the amalgamation of multiple similarity measures can potentially bolster model generalization and predictive prowess.

Weight assignment in multi-similarity metric-driven models
Given that the Kth similarity measure of a query sample is symbolized as SK, its prediction result is encapsulated as  ̂, .For this research,   is defined as the ensemble weight linked to the prediction outcomes from the similarity SK.The eventual composite output is delineated in Eq. (21).
In this equation,   can be deduced from the prediction outcomes of the cross-validation dataset, with the projected result represented as  ̂ℎ, , where h denotes the count of submodels derived from the cross-validation dataset.The mean squared error (RMSE) corresponding to each similarity measure   is laid out in Eq. ( 22).Following this, the weight   tied to the prediction outcomes for each similarity measure   can be discerned as shown in Eq. ( 23).Conclusively, local online models crafted via these three similarity measures are weighted for output, with the procedural flow chart of the enhanced model depicted in Figure 1.

Dynamic partitioning of datasets with moving windows
The imperative nature of selecting apt sample sets in immediate learning strategies is well-recognized.However, extracting sample sets from exhaustive historical data imposes an excessive computational load.In the study of the fermentation process of Saccharomyces cerevisiae, a continuously fluctuating biochemical reaction is observed.Due to the similarity in data attributes from consecutive time intervals, an MW approach is advocated for the preliminary selection of analogous samples.
MW, characterized as an adaptive modeling technique, operates on the basis of sampling time.Historical samples are systematically divided into subsequent moving window modules, ensuring the presence of data with congruent attributes within individual windows.With the progression of time, updates to the dataset are made by discarding aged samples and introducing newer ones (as illustrated in Figure 2).
As demonstrated in Figure 2, let the window length be denoted by L. Upon the arrival of a query sample, it is positioned at the extreme right of the window.The first window's data matrix is represented as {  1 ,   1 }.With each successive movement, the window shifts one unit rightward along the time axis, marking the input and output datasets as {  2 , ,   2 }.This action is reiterated until the query sample aligns with the window's leftmost end.
A pivotal aspect of the moving window technique is the determination of an optimal window length, L. Overly extensive windows risk encompassing redundant samples, thereby amplifying computational complexity.On the contrary, extremely narrow windows may not aptly represent process variability.Sadly, no universal method exists for ascertaining the perfect window size, making repetitive experimentation a standard approach to derive a suitable value.
Simulations revealed a notable observation: minor movements of the window often led to significant similarities in data across neighboring windows.Such a scenario precipitates the construction of sub-models with excessive resemblances, inadvertently increasing computational intricacies and potentially impairing predictive accuracy.To maintain diversity within sub-datasets, the concept of cumulative similarity threshold was integrated, as delineated in Eq. ( 24) [19].
where,   epitomizes the proportion of the cumulative similarity of group k+1's sub-dataset to that of group k.L signifies the count of sampling points encompassed within the window.If   ∈ [1 − , 1 + ] , subset k+1 is negated and exempted from modeling.Conversely, a value indicates the prevalent diversity within the windowed sub-dataset.Cumulative similarity,   , was ascertained utilizing the crossvalidation method, represented by ε=0.5.Significantly, the algorithm's foundation, represented by the moving window, only employs the similarity measure  1 as the exclusive screening criterion for moving window selection in this research.

Ensemble learning application to moving windows
For enhanced prediction accuracy, an ensemble learning strategy was incorporated to amalgamate the results from each window's data output.This fusion process is achieved using a weighted average method, expressed in Eq. (25).

( ) ( )
In the above equation,   stands for the weightage of the base learner ℎ  () within the ith moving window, and T represents the quantity of moving windows post-filtration.Weight calculations were based on the moving window's screening criteria mentioned earlier.By leveraging the weighted averaging algorithm, improvements in learner generalization and predictive precision were observed.A comprehensive flow chart detailing the construction of a moving window model via ensemble learning is provided in Figure 3.

Auxiliary variable selection through k-nearest neighbor mutual information
The fermentation process of Pichia pastoris is delineated as a multi-phasic operation.Within its distinct stages, auxiliary variables exert varied influences upon the principal parameters.In specific fermentation phases, certain environmental variables are necessitated to be meticulously adjusted and sustained at designated values.Consequently, these variables remain unamenable to localized modeling.A challenge arises when JITL indiscriminately employs auxiliary variables for both local and global modeling, inadvertently compromising model precision.
In light of the aforementioned concerns, a strategy for preselecting auxiliary variables for sample data, defined by the moving window's range, was devised.It should be underscored that this approach abstains from filtering the auxiliary variables for each data subset derived from the moving window.As the query sample is presented, the auxiliary variables spanning the entire sample data traversed by the moving window are mandated to undergo filtering through the application of the k-MI algorithm [20].Shannon's [21] proposition of mutual information serves as a metric quantifying the interrelation between variables.Mutual information is emblematic of the shared data between the random variables x and y.A pronounced correlation between two random variables is directly proportional to an augmented mutual information.The computation of mutual information is depicted in Eq. ( 26).In the equation, p(x) and p(y) represent the marginal probability distributions of the random variables x and y, respectively, whilst p(x, y) symbolizes the joint probability distribution intertwining the two random variables.Acquiring the probability distributions amidst the auxiliary variables in Pichia pastoris's fermentation process is recognized as an intricate task.Employing the k-MI algorithm has been observed to efficiently mitigate the computational strain.Furthermore, the algorithm's capability to curtail the count of auxiliary variables markedly enhances the model's reaction velocity.The k-MI algorithm's computation is elucidated in Eq. ( 27).

MODELING STRATEGY OF THE SOFT SENSOR (EK-MW-LWPLS)
The EK-MW-LWPLS algorithm's modeling flow is graphically presented in Figure 4.
For a lucid understanding of the soft sensor model's construction, the modeling strategy is elucidated through the subsequent sequence: Step 1: Data spanning multiple batches are amassed, consequent to which a comprehensive database is established.
Thereafter, an ordering of the data based on the temporal aspect of sampling is executed.
Step 2: The length of the moving window is designated as L, leading to the calculation of the cumulative span across which the moving window is to traverse.Within the circumscribed local range, the k-MI algorithm is deployed for the discerning selection of auxiliary variables.
Step 3: A systematic sliding of the moving window, commencing from the leftmost bound and concluding when the query sample aligns with the window's left extremity, culminates in the generation of N distinct sub-datasets.
Step 4: Three distinct methodologies for gauging similarity are explored, of which the similarity measure, S1, is employed for the computation of the cumulative similarity threshold.This step results in the elimination of superfluous moving windows.
Step 5: A disaggregated approach is adopted; wherein disparate similarity measures undergo individual training sessions tailored to the sub-models under the purview of the MW-LWPLS algorithm.
Step 6: Variegated moving windows facilitate the creation of sub-models.Subsequently, predictions derived from these are amalgamated via a weighted average, culminating in the final output.

Procedural overview of Pichia pastoris fermentation
Employing the fermentation process of Pichia pastori as the focal point, the strain Pichia pastoris GS115, MutS His+ was chosen for the experimental trials.The pilot platform, furnished by Yangzhong Jiaocheng Biotechnology Research Co., Ltd, served as the venue for the completion of the fermentation process, utilizing the A103-500L model of the fermentation tank.A visual representation of the Pichia pastoris fermentation process can be observed in Figure 5. • Prior to the inoculation of the strain into the vessel, preparatory steps encompassing culture medium preparation, flask shaking, and sterilization of the fermentation apparatus were diligently executed in line with procedural guidelines.Subsequent to these, the medium was subjected to sterilization at a temperature of 130℃ for a span of 30 minutes.Upon cooling to 30℃, flame inoculation was employed to introduce the strain.
• Data variables, garnered through sampling at 15-minute intervals, were meticulously archived in a structured database and arranged chronologically.Both cell and product concentrations were sampled at bi-hourly intervals.
• In real-world scenarios, it was observed that the number of environmental variables recorded by sensor instruments considerably surpassed the quantity of offline samples.To align the measurement timelines of cell and product concentration with the environmental variables' sampling time, interpolation was harnessed to compensate for any data voids.Initially, ten batches of interpolated data, amounting to 10,800 groupings, were chronologically ordered.This was followed by noise filtration from the dataset, preserving 10,325 groupings as the final dataset.For the model's purposes, nine data sets were designated for training, while a single set was earmarked for testing.

Determination of auxiliary variables
Upon examination of the Pichia pastoris fermentation mechanism, both cell concentration X and product concentration S were identified as primary variables in the process.In this context, 16 environmental variables, which can be directly measured, were elected as auxiliary variables, delineated in Table 1.Utilizing the k-MI algorithm, a screening of auxiliary variables was conducted.Subsequently, mutual information between the environmental variables and cell concentration was calculated and sequenced based on magnitude, as depicted in Figure 6. Figure 6(a) presents the mutual information between environmental variables and cell concentrations, corresponding to query variables across three distinct stages.Meanwhile, Figure 6(b) portrays the mutual information between the environmental variables and cell concentration related to a specific stage's query variables, methodically sequenced by magnitude.In this study, the six paramount environmental variables, in terms of correlation, were chosen to formulate the soft sensor predictive models.A given stage's query variable aligns with a cell concentration's predictive model, exemplified in Eq. ( 28).

Ascertainment of moving window length
For optimizing computational efforts and enhancing modeling efficiency, it is imperative that the length of moving windows remains minimal while safeguarding model accuracy.Given the existence of nine data batches within the database, it was observed that excessively short or long moving window lengths directly influenced predictive outcomes.Under the stipulated condition L∈ [20,140], the MW-LWPLS model's predictive accuracy was evaluated at intervals of 20 sampling points.The corresponding predictive results are presented in Figure 7.As shown in Figure 7, when L=40 and L=140, the actual and predicted values of cell concentration are significantly shifted, and dispersion was large.When L=60 and L=120, the prediction of pre-fermentation and smooth period was more satisfactory, but the error of the exponential growth period of fermentation was larger.If there is a large error in prediction during the exponential growth period of fermentation, it will lead to an inability to replenish the raw material in time for the fermentation process, thus making the fermentation less effective than expected and affecting the yield.When L=80 and L=100, the dispersion of the prediction results becomes significantly smaller.To further determine the length of the moving window L, the root means square error (RMSE) and the coefficient of determination (R 2 ) are used as criteria for judging the model, as shown in Table 2.
Table 2 shows that RMSE decreases and R 2 increases as the length of L increases from 40 to 80.As the L length increases from 80 to 140, the RMSE increases, and the R 2 decreases.Therefore, the optimal window length of around 80 can be determined and applied to the subsequent soft sensor model.

Simulation results and analysis of the Ek-MW-MSLWPLS model
Upon the ascertainment of the optimal window length, the LWPLS model, MW-LWPLS model, Ek-MW-LWPLS model, and Ek-MW-MSLWPLS model were constructed to assess the relative efficacy of the Ek-MW-MSLWPLS algorithm.Predictive results for cell concentration across these models are illustrated in Figure 8.To facilitate a comprehensive comparison across the four models, evaluative metrics, namely the root mean square error (RMSE), coefficient of determination (R 2 ), mean relative error (MRE), and maximum absolute error (MAX), were employed.Table 3 presents these results, signifying that the Ek-MW-MSLWPLS model's RMSE, MRE, and MAX were markedly inferior to those of the other models, while its R 2 approached unity, indicating a superior performance of the Ek-MW-MSLWPLS soft sensor model.For a more tangible comparison of the predictive capabilities of the four models, predictive residuals for both bacterial and product concentrations are showcased in Figure 10.A discernible superiority of the Ek-MW-MSLWPLS soft sensor model over its counterparts is evident.

CONCLUSIONS
The fermentation process involving Pichia pastoris has been characterized by its temporal variability, inherent delay, and non-linearity.In situations where environmental conditions undergo alterations, offline models have been observed to falter.To address these challenges, the Ek-MW-MSLWPLS soft sensor model was introduced, offering an online solution.Data subsets were delineated through moving windows, and, within these data subsets, explanatory variables underwent filtration via the k-MI algorithm.Local modeling was driven by a multi-similarity measurement, promoting timely learning.As a culminating step, dual ensemble learning facilitated the amalgamation of all models present in the sub-datasets.Based on the experimental outcomes, it was deduced that the Ek-MW-MSLWPLS model possesses the capability to predict cell and product concentrations online with precision.

Figure 2 .Figure 3 .
Figure 2. Illustration of dataset division through moving windows

Figure 5 .
Figure 5. Schematic of the Pichia pastoris fermentation process

Figure 6 .
Figure 6.Mutual information association between environmental variables and cell concentrations within moving windows across diverse query variables, and arrangement of mutual information between environmental variables and cell concentrations for singular query variable

Figure 8 .
Figure 8. Depiction of cell concentration predictive curves across four model types In Figure 8(a), notable deviations between the predictive cell concentration curve and the actual curve were observed.Figure 8(b) highlights how, upon incorporating a moving window for cell concentration predictions, a reduction in overall bias was achieved, though certain phases still manifested pronounced errors.Figure 8(c) exhibits a method where both local auxiliary variable selection and an integrated algorithm for moving window optimization were integrated into the model from Figure 8(b).This integration led to marked enhancements in predictive accuracy, albeit with heightened fluctuations and dispersed individual predictive points.The implementation of a multi-similarity measurement-driven model combined with an integrated output approach in Figure 8(d) revealed superior predictive accuracy, with curves

Figure 8 (
b) highlights how, upon incorporating a moving window for cell concentration predictions, a reduction in overall bias was achieved, though certain phases still manifested pronounced errors.

Figure 8 (
c) exhibits a method where both local auxiliary variable selection and an integrated algorithm for moving window optimization were integrated into the model from Figure 8(b).This integration led to marked enhancements in predictive accuracy, albeit with heightened fluctuations and dispersed individual predictive points.The implementation of a multi-similarity measurement-driven model combined with an integrated output approach in Figure 8(d) revealed superior predictive accuracy, with curves aligning more closely.Further validation of the algorithm's feasibility entailed employing the model to predict the product concentration of Pichia pastoris, illustrated in Figure 9.The integration of the moving window ostensibly refined the sample search scope, subsequently bolstering the efficacy of the local model compared to a broader search.A comparison between Figures 9(a) and 9(b) highlighted the moving window's instrumental role in bolstering the model's curve tracking capacity.Yet, a sharp alteration in the actual curve revealed a deficiency in the model's tracking proficiency, as evidenced in Figure 9(c).Nevertheless, Figure 9(d) demonstrates that the introduction of multiple similarity metrics significantly mitigated this issue.

Figure 9 .Figure 10 .
Figure 9. Four model predictive curves for cell concentration

Table 3 .
Comparative analysis of cell and product concentration predictive errors across models