JOURNAL METRICS

CiteScore 2023: 2.5 ℹCiteScore:

CiteScore is the number of citations received by a journal in one year to documents published in the three previous years, divided by the number of documents indexed in Scopus published in those same three years.

SCImago Journal Rank (SJR) 2023: 0.239 ℹSCImago Journal Rank (SJR):

The SJR is a size-independent prestige indicator that ranks journals by their 'average prestige per article'. It is based on the idea that 'all citations are not created equal'. SJR is a measure of scientific influence of journals that accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from It measures the scientific influence of the average article in a journal, it expresses how central to the global scientific discussion an average article of the journal is.

Source Normalized Impact per Paper (SNIP) 2023: 0.67 ℹSource Normalized Impact per Paper(SNIP):

SNIP measures a source’s contextual citation impact by weighting citations based on the total number of citations in a subject field. It helps you make a direct comparison of sources in different subject fields. SNIP takes into account characteristics of the source's subject field, which is the set of documents citing that source.

qqtu_pian_20240428144739.png

Improving the Speed of Support Vector Regression Using Regularized Least Square Regression

Sanaz Sagha Pirmard | Yahya Forghani^*

Computer Department, Imam Reza International University, Mashhad 9138833186, Iran

Computer Department, Islamic Azad University, Mashhad Branch, Mashhad 9187147578, Iran

Corresponding Author Email:

yforghani@mshdiau.ac.ir

Received:

20 May 2020

Revised:

15 July 2020

Accepted:

26 July 2020

Available online:

20 September 2020

| Citation

25.04_04.pdf

OPEN ACCESS

Abstract:

The Regularized Least Square (RLS) method is one of the fastest function estimation methods, but it is too sensitive to noise. Against this, ε-insensitive Support Vector Regression (ε-SVR) is robust to noise but doesn’t have a good runtime. ε-SVR supposes that the noise level is at most ε. Therefore, the center of a tube with radius ε, which is used as the estimated function, is determined in a way that the training data are located in that tube. Therefore, this method is robust to such noisy data. In this paper, to improve the runtime of ε-SVR, first, an initial estimated function is obtained using the RLS method. Then, unlike the ε-SVR model, which uses all the data to determine the lower and upper limits of the tube, our proposed method uses the initial estimated function for determining the tube and the final estimated function. Strictly speaking, the data below and above the initial estimated function are used to estimate the upper and lower limits of the tube, respectively. Thus, the number of the model constraints and, consequently, the model runtime are reduced. The experiments carried out on 15 benchmark data sets confirm that our proposed method is faster than ε-SVR, ε-TSVR and pair v-SVR, and its accuracy are comparable with that of ε-SVR, ε-TSVR and pair v-SVR.

Keywords:

ε-insensitive support vector regression (ε-SVR), regularized least square (RLS), runtime, function estimation

1. Introduction

One of the most commonly used eager learning method is least-square method (LS) [1], which first was proposed by Carl Friedesh Gauss. In the LS method, the best estimated function is a function that the sum of square of the difference between the observed values and the corresponding estimated values with the estimated function is minimum. Later, on the basis of the LS method, the Regulation Least Square (RLS) method was proposed [2, 3]. The RLS uses a regularization term to reduce the complexity of estimated function and to increase its generalization ability.

ε-insensitive Support Vector Regression (ε-SVR) [4] supposes that the noise in the output variable is at most ε. Therefore, the center of a tube with radius ε, which is used as the estimated function, is determined in a way that the training data are located in that tube. Therefore, this method is robust to such noisy data. In v-SVR [5], the maximum noise level or the tube radius is considered as a model variable, which is determined in the process of solving the model. In both ε-SVR and v-SVR, the maximum noise level or the radius of the tube is considered to be constant or independent in terms of the training data characteristics. To alleviate this shortcoming, Support Vector Interval Regression Networks (SVIRNs) was proposed [6]. In SVIRNs, the upper and lower bounds of the tube are automatically determined by two independent RBF networks. In this model, the radius of the tube can change by changing the features of the training data. Therefore, SVIRNs is robust to a noise whose maximum level depends on training data features. These two networks are initialized using ε-SVR. v-Support Vector Interval Regression Networks (v-SVIRN) [7] determines the upper and lower bounds of the tube by only one neural network or one model. In Twin SVR (TSVR) [8], estimated function is determined by solving two small Quadratic Programming Problems (QPPs), while ε-SVR solves one large QPP. Solving the TSVR model needs less time than solving ε-SVR model. Later, in ε-TSVR [9], a regularization term was used to reduce the complexity of estimated function and to increase its generalization ability. The regularization term of ε-TSVR is L2-norm of the estimated function weight vector, while in L1-TWSVR model [10], L1-norm of the estimate function weight vector was used as the regularization term. L1-TWSVR is a linear programming, while ε-TSVR is a QPP. As well, the estimated function of L1-TWSVR is sparser than that of ε-TSVR. Therefore, the testing time of L1-TWSVR model is less than that of ε-TSVR. Pair v-SVR is extended version of ε-TSVR which has additional advantage of using parameter v for controlling the bounds on fraction of support vectors and error [11].

All the mentioned SVR-based methods assume that the noise distribution is symmetrical, therefore, the center of a tube which contain noisy data is used as the estimated function. In other words, the estimated function is determined in such a way that approximately half of the training data are positioned above it, and the remaining data are positioned below it. In the asymmetric v-SVR method [12], the estimated function is determined in such a way that an arbitrary number of training data are positioned above it and the remaining data are positioned below it. The size of each of these two parts of data is determined based on the prior knowledge about the noise distribution. Next, twin version of asymmetric v-SVR, i.e. asymmetric v-TSVR [13], was proposed to improve the runtime of the asymmetric v-SVR model.

Among the mentioned methods, the LS and RLS methods have the lowest runtime because each of them is an n + 1 variable problem with a close-form solution which is obtained by inverting an n × n matrix, but they are too sensitive to noisy data. Against this, SVR-based models have a good robustness against noise but do not have a good runtime because for example to solve ε-SVR problem, a quadratic programming problem with 2×n variables and linear constraints must be solved which is too time consuming. In this paper, to improve the runtime of ε-SVR, first, an initial estimated function is obtained using the RLS method. Then, unlike the ε-SVR model, which uses all training data to determine the lower and upper limits of the tube, our proposed method uses the initial estimated function for determining the tube and the final estimated function. In other words, the data below and above the initial estimated function are used to estimate the upper limit and the lower limit of the tube, respectively. Thus, the number of the model constraints and, consequently, the model runtime are reduced. The experiments carried out on 15 benchmark data sets confirm that our proposed method is faster than ε-SVR, ε-TSVR and pair v-SVR, and its accuracy are comparable with that of ε-SVR, ε-TSVR and pair v-SVR.

In continue, in Section 2, RLS and ε-SVR are introduced. Then, in Section 3, our proposed method is presented. In Section 4, the results of the experiments are presented and Section 5 provides conclusion.

2. Background

2.1 RLS

Let {(x₁,y₁), (x₂,y₂), …, (x_n,y_n)}, be training data where $\mathrm{x} \in \mathrm{R}^{\mathrm{d}}$ is input variable and $\mathrm{y} \in \mathrm{R}$ is output variable. Suppose $\varphi(.)$ is a function which transforms data from the input space into a high-dimensional feature space. The goal is to find the function f(x)=w^Tφ(x)+b with the weight vector $\sum_{i=1}^{n} \alpha_{i} \varphi\left(x_{i}\right)$ and the bias b such that

$\forall i: y_{i} \cong f\left(x_{i}\right)$ (1)

For this purpose, the following model is solved:

$\min _{\alpha} \frac{1}{2}\left(\|\alpha\|^{2}+\mathrm{b}^{2}\right)+\frac{\mathrm{C}_{1}}{2} \sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{f}\left(\mathrm{x}_{\mathrm{i}}\right)-\mathrm{y}_{\mathrm{i}}\right)^{2}$ (2)

where, α=(α₁, α₂, …, α_n)^T. The first term of model (2) is intended to prevent over-fitting and to increase the generalizability. The parameter C₁ controls the importance of the estimated function training error against generalization ability. Model (2) can be written as follows:

$\min _{\alpha} \frac{1}{2}\left(\|\alpha\|^{2}+\mathrm{b}^{2}\right)+\frac{\mathrm{C}_{1}}{2}\|\mathrm{K} \alpha+\mathrm{b}-\mathrm{y}\|^{2}$ (3)

where, y=(y₁, y₂, …, y_n)^T and K is a matrix of which the i-th row of j-th column, i.e. K_ij, is equal to k(x_j,x_i)= φ^T(x_j)φ(x_i) where k(…) is a kernel function. The model (3) can be written as follows:

$\min _{\widetilde{\alpha}} L=\frac{1}{2}\|\widetilde{\alpha}\|^{2}+\frac{C_{1}}{2}\|\widetilde{\mathrm{K}} \widetilde{\alpha}-\mathrm{y}\|^{2}$ (4)

where,

$\widetilde{\alpha}=\left(\begin{array}{c}\alpha \\ \mathrm{b}\end{array}\right)$ (5)

$\widetilde{\mathrm{K}}=(\mathrm{K} \quad 1)$ (6)

At the optimal point of the model (4), we have:

$0=\frac{\partial \mathrm{L}}{\partial \widetilde{\alpha}}=\widetilde{\alpha}+\mathrm{C}_{1} \widetilde{\mathrm{K}}^{\mathrm{T}} \widetilde{\mathrm{K}} \widetilde{\alpha}-\mathrm{C}_{1} \widetilde{\mathrm{K}}^{\mathrm{T}} \mathrm{y}$ (7)

Therefore,

$\widetilde{\alpha}=C_{1}\left(\mathrm{I}+\mathrm{C}_{1} \widetilde{\mathrm{K}}^{\mathrm{T}} \widetilde{\mathrm{K}}\right)^{-1} \widetilde{\mathrm{K}}^{\mathrm{T}} \mathrm{y}$ (8)

RLS is sensitive to noise because when y_i is noisy, the optimal function f(.) changes mistakenly in order to the objective function of RLS become minimum according to Eq. (2).

2.2 ε-SVR method

ε-SVR supposes that the noise level is at most ε. Therefore, the center of a tube with radius ε, which is used as the estimated function, is determined in a way that the training data are located in that tube. In other words, the function f(x)=w^Tφ(x)+b is determined in such a way that as much as possible,

$\forall i:-\varepsilon \leq y_{i}-f\left(x_{i}\right) \leq \varepsilon$ (9)

To do this, the following model is solved:

$\min _{\mathrm{w}, \mathrm{b}, \xi, \hat{\xi}} \frac{1}{2}\|\mathrm{w}\|^{2}+\mathrm{C} \sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\xi_{\mathrm{i}}+\hat{\xi}_{\mathrm{i}}\right)$

subject to $\left\{\begin{array}{l}\mathrm{y}_{\mathrm{i}}-\mathrm{f}\left(\mathrm{x}_{\mathrm{i}}\right) \leq \varepsilon+\xi_{i}, \quad i=1,2, \ldots, n; \\ \mathrm{f}\left(\mathrm{x}_{\mathrm{i}}\right)-\mathrm{y}_{\mathrm{i}} \leq \varepsilon+\hat{\xi}_{i}, \quad i=1,2, \ldots, n; \\ \xi_{\mathrm{i}}, \hat{\xi}_{\mathrm{i}} \geq 0 \mathrm{i}=1,2, \ldots, \mathrm{n};\end{array}\right.$ (10)

where, ξ and $\hat{\xi}$ allow outliers are not positioned outside the tube (see Figure 1). The number of outliers positioned outside the tube is controlled by the parameter C≥0.

1.png

Figure 1. The estimated function of ε-SVR for the training data represented by x signs

The dual of the model (10) is as follows:

$\max _{\alpha, \hat{\alpha}} \sum_{i=1}^{n}\left(\alpha_{i}-\hat{\alpha}_{i}\right) y_{i}-\varepsilon \sum_{i=1}^{n}\left(\hat{\alpha}_{i}+\alpha_{i}\right)$$-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n}\left(\alpha_{i}-\widehat{\alpha}_{i}\right)\left(\alpha_{j}-\widehat{\alpha}_{j}\right) k\left(x_{i}, x_{j}\right)$

snbject to $\left\{\begin{array}{l}\sum_{i=1}^{n}\left(\alpha_{i}-\hat{\alpha}_{i}\right)=0 \\ 0 \leq \alpha_{i} \leq C, \quad i=1, \ldots, n \\ 0 \leq \hat{\alpha}_{i} \leq C, \quad i=1, \ldots, n\end{array}\right.$ (11)

According to the Karush-Kuhn-Tucker (KKT) conditions [14], we have:

$\mathrm{w}=\sum_{i=1}^{\mathrm{n}}\left(\alpha_{\mathrm{i}}-\widehat{\alpha}_{\mathrm{i}}\right) \varphi\left(\mathrm{x}_{\mathrm{i}}\right)$ (12)

Therefore, the estimated function is determined according to the following formula:

$\mathrm{f}(\mathrm{x})=\mathrm{w}^{\mathrm{T}} \varphi(\mathrm{x})+\mathrm{b}=\sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\alpha_{\mathrm{i}}-\widehat{\alpha}_{\mathrm{i}}\right) \mathrm{k}\left(\mathrm{x}_{\mathrm{i}}, \mathrm{x}\right)+\mathrm{b}$ (13)

where, based on the KKT conditions, the bias is determined by the following formula:

$\mathrm{b}=\left\{\begin{array}{ll}\mathrm{y}_{\mathrm{i}}-\sum_{\mathrm{j}=1}^{\mathrm{n}}\left(\alpha_{\mathrm{j}}-\widehat{\alpha}_{\mathrm{j}}\right) \mathrm{k}\left(\mathrm{x}_{\mathrm{j}}, \mathrm{x}_{\mathrm{i}}\right)-\varepsilon, & 0<\alpha_{\mathrm{i}}<\mathrm{C} \\ \mathrm{y}_{\mathrm{i}}-\sum_{\mathrm{j}=1}^{\mathrm{n}}\left(\alpha_{\mathrm{j}}-\widehat{\alpha}_{\mathrm{j}}\right) \mathrm{k}\left(\mathrm{x}_{\mathrm{j}}, \mathrm{x}_{\mathrm{i}}\right)+\varepsilon, & 0<\widehat{\alpha}_{\mathrm{i}}<\mathrm{C}\end{array}\right.$ (14)

3. Our Proposed Method

The RLS have the good runtime, but it is sensitive to noisy data. Against this, ε-SVR model-based models are robust to noise but do not have a good runtime. In this paper, to improve the runtime of ε-SVR, first, an initial estimated function is obtained using the RLS method. Then, unlike the ε-SVR model, which uses all training data to determine the lower and upper limits of the tube, our proposed method uses the initial estimated function for determining the tube and the final estimated function. In other words, the data below and above the initial estimated function are used to estimate the upper and the lower limits of the tube, respectively. Thus, the number of the model constraints and, consequently, the model runtime are reduced.

Let $\mathrm{f}(\mathrm{x})=\sum_{\mathrm{i}=1}^{\mathrm{n}} \alpha_{\mathrm{i}} \mathrm{k}\left(\mathrm{x}_{\mathrm{i}}, \mathrm{x}\right)+\mathrm{b}$ be initial estimated function obtained by using the RLS model based on the training data {(x₁,y₁), (x₂,y₂), …, (x_n,y_n)}. If f(x_i )≤0 then (x_i,y_i) is a point below the function f(.); otherwise, (x_i,y_i) is a point above the function f(.). Let I_a be the index of the training data above the function f, and I_b be the index of the training data below the function f, and |I_a|+|I_b|=n. Our proposed model for determining a tube with the radius ε and the center $\tilde{f}(x)=w^{T} \varphi(x)+b$ is as follows:

$\operatorname{Min}_{\mathbf{w}, \mathbf{b}, \bar{\xi}} \frac{1}{2}\|\mathbf{w}\|^{2}+\mathrm{C}_{2} \sum_{i=1}^{\mathrm{n}} \xi_{\mathrm{i}}$

subject to $\left\{\begin{array}{l}\mathrm{y}_{\mathrm{i}}-\tilde{\mathrm{f}}\left(\mathrm{x}_{\mathrm{i}}\right) \leq \varepsilon+\xi_{\mathrm{i}}, \mathrm{i} \in \mathrm{I}_{\mathrm{a}} \\ \tilde{\mathrm{f}}\left(\mathrm{x}_{\mathrm{i}}\right)-\mathrm{y}_{\mathrm{i}} \leq \varepsilon+\xi_{\mathrm{i}}, \mathrm{i} \in \mathrm{I}_{\mathrm{b}} \\ \xi_{\mathrm{i}} \geq 0, \quad \mathrm{i}=1,2, \ldots, \mathrm{n}\end{array}\right.$ (15)

where, ξ allows outliers are not positioned outside the tube. The number of outliers positioned outside the tube is controlled by the parameter C₂. Unlike ε-SVR which uses all training data to determine the lower and upper bounds of the tube, in our proposed model, only the upper half of the training data is used to determine the upper limit of the tube, and also only the lower half of the training data is used to determine the lower limit of the tube. The upper half and lower half of training data are determined based on the initial estimated function obtained by the RLS. The Lagrange function of the model (15) is as follows:

$\mathrm{L}=\frac{1}{2}\|\mathrm{w}\|^{2}+\mathrm{C} \sum_{\mathrm{i}=1}^{\mathrm{n}} \xi_{\mathrm{i}}$

$-\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{a}}} \alpha_{\mathrm{i}}\left(\varepsilon+\xi_{\mathrm{i}}-\mathrm{y}_{\mathrm{i}}+\mathrm{w}^{\mathrm{T}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)+\mathrm{b}\right)$

$-\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{b}}} \alpha_{\mathrm{i}}\left(\varepsilon+\xi_{\mathrm{i}}+y_{\mathrm{i}}-\mathrm{w}^{\mathrm{T}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)-\mathrm{b}\right)-\sum_{\mathrm{i}=1}^{\mathrm{n}} \gamma_{\mathrm{i}} \xi_{\mathrm{i}}$ (16)

where, γ_i and α_i are Lagrange coefficients. At the optimal point of the model (15) we have:

$\frac{\partial \mathrm{L}}{\partial \mathrm{w}}=0 \rightarrow \mathrm{w}=\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{a}}} \alpha_{\mathrm{i}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)-\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{b}}} \alpha_{\mathrm{i}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)$ (17)

$\frac{\partial \mathrm{L}}{\partial \mathrm{b}}=0 \rightarrow \sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{a}}} \alpha_{\mathrm{i}}-\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{b}}} \alpha_{\mathrm{i}}=0$ (18)

$\frac{\partial \mathrm{L}}{\partial \xi}=0 \rightarrow \alpha_{\mathrm{i}}=\mathrm{C}_{2}-\gamma_{\mathrm{i}}, \quad \mathrm{i}=1,2, \ldots, \mathrm{n}$ (19)

$\mathrm{y}_{\mathrm{i}}-\mathrm{w}^{\mathrm{T}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)-\mathrm{b} \leq \varepsilon+\xi_{\mathrm{i}}, \quad \mathrm{i} \in \mathrm{I}_{\mathrm{a}}$ (20)

$\mathrm{w}^{\mathrm{T}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)+\mathrm{b}-\mathrm{y}_{\mathrm{i}} \leq \varepsilon+\xi_{\mathrm{i}}, \quad \mathrm{i} \in \mathrm{I}_{\mathrm{b}}$ (21)

$\alpha_{\mathrm{i}}\left(\varepsilon+\xi_{\mathrm{i}}-y_{\mathrm{i}}+\mathrm{w}^{\mathrm{T}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)+\mathrm{b}\right)=0, \quad \mathrm{i} \in \mathrm{I}_{\mathrm{a}}$ (22)

$\alpha_{i}\left(\varepsilon+\xi_{i}+y_{i}-w^{T} \varphi\left(x_{i}\right)-b\right)=0, \quad i \in I_{b}$ (23)

$\gamma_{i} \xi_{i}=0, \quad i=1,2, \ldots, n$ (24)

$\xi_{\mathrm{i}}, \alpha_{\mathrm{i}}, \gamma_{\mathrm{i}} \geq 0, \quad \mathrm{i}=1,2, \ldots, \mathrm{n}$ (25)

By substituting Eq. (17) to Eq. (24) into the Lagrange function, this function is simplified as follows:

$\mathrm{L}=\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{o}}} \alpha_{\mathrm{i}} \mathrm{y}_{\mathrm{i}}-\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{u}}} \alpha_{\mathrm{i}} \mathrm{y}_{\mathrm{i}}-\varepsilon \sum_{\mathrm{i}=1}^{\mathrm{n}} \alpha_{\mathrm{i}}$

$-\frac{1}{2}\left(\sum_{i \in I_{0}} \sum_{j \in I_{0}} \alpha_{i} \alpha_{i} k\left(x_{i}, x_{j}\right)+\sum_{i \in I_{u}} \sum_{j \in I_{u}} \alpha_{i} \alpha_{j} k\left(x_{i}, x_{j}\right)-2 \sum_{i \in I_{0}} \sum_{j \in I_{u}} \alpha_{i} \alpha_{j} k\left(x_{i}, x_{i}\right)\right)$ (26)

In addition, according to Eq. (19), and since α_i,γ_i≥0, we have:

$0 \leq \alpha_{\mathrm{i}} \leq \mathrm{C}_{2}, \quad \mathrm{i}=1,2, \ldots, \mathrm{n}$ (27)

Therefore, the dual of the model (15) is as follows:

$\max _{\alpha} \sum_{i \in I_{a}} \alpha_{i} y_{i}-\sum_{i \in I_{b}} \alpha_{i} y_{i}-\varepsilon \sum_{i=1}^{n} \alpha_{i}$

$-\frac{1}{2}\left(\sum_{i \in I_{a}} \sum_{j \in I_{a}} \alpha_{i} \alpha_{j} k\left(x_{i}, x_{j}\right)+\sum_{i \in I_{b}} \sum_{j \in I_{b}} \alpha_{i} \alpha_{j} k\left(x_{i}, x_{j}\right)\right.$

$\left.-2 \sum_{i \in I_{a}} \sum_{j \in I_{b}} \alpha_{i} \alpha_{j} k\left(x_{i}, x_{j}\right)\right)$

subject to $\left\{\begin{array}{l}\sum_{i \in I_{a}} \alpha_{i}-\sum_{i \in I_{b}} \alpha_{i}=0 \\ 0 \leq a_{i} \leq C_{2}, \quad i=1,2, \ldots, n\end{array}\right.$ (28)

If we define:

$\beta_{i}=\left\{\begin{array}{ll}\alpha_{i}, & i \in I_{0} \\ -\alpha_{i}, & i \in I_{u}\end{array}\right.$ (29)

then, the model (28) can be written as follows:

$\max _{\beta} \sum_{i=1}^{n} \beta_{i} y_{i}-\varepsilon\left(\sum_{i \in I_{a}} \beta_{i}-\sum_{i \in I_{b}} \beta_{i}\right)$$-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \beta_{i} \beta_{j} k\left(x_{i}, x_{j}\right)$

subject to $\left\{\begin{array}{l}\sum_{i=1}^{n} \beta_{i}=0 \\ 0 \leq \beta_{i} \leq C_{2}, \quad i \in I_{a} \\ -C_{2} \leq \beta_{i} \leq 0, \quad i \in I_{b}\end{array}\right.$ (30)

which is a convex quadratic programming problem. After solving the model (30) and obtaining the optimal value of $\beta$, the optimal values of $\alpha$ can be determined using Eq. (29). Given Eq. (22), for each $\mathrm{i} \in \mathrm{I}_{\mathrm{a}}$, if α_i>0 then

$y_{i}-w^{T} \varphi\left(x_{i}\right)-b=\varepsilon+\xi_{i}$ (31)

and according to Eq. (19), if α_i<C₂ then

$\gamma_{i}>0$ (32)

Then, according to Eq. (24), we have:

$\xi_{i}=0$ (33)

Thus, for each $\mathrm{i} \in \mathrm{I}_{\mathrm{a}}$, if 0<α_i<C₂, the bias can be obtained from the following equation:

$\mathrm{b}=y_{\mathrm{i}}-\mathrm{w}^{\mathrm{T}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)-\varepsilon$ (34)

According to Eq. (23), for each i $\mathrm{i} \in \mathrm{I}_{b}$, if α_i>0, then

$-y_{i}+w^{T} \varphi\left(x_{i}\right)+b=\varepsilon+\xi_{i}$ (35)

and according to Eq. (19), if α_i<C₂, then

$\gamma_{i}>0$ (36)

Then, according to Eq. (24), we have:

$\xi_{\mathrm{i}}=0$ (37)

Thus, for each $\mathrm{i} \in \mathrm{I}_{\mathrm{u}}$, if 0<α_i<C₂, the bias can be obtained from the following equation:

$\mathrm{b}=\varepsilon+\mathrm{y}_{\mathrm{i}}-\mathrm{w}^{\mathrm{T}} \varphi\left(\mathrm{x}_{\mathrm{i}}\right)$ (38)

According to Eq. (17), the optimal hyperplane or estimated function is as follows:

$f(x)=w^{T} \varphi(x)+b$$=\mathrm{b}+\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{a}}} \alpha_{\mathrm{i}} \mathrm{k}\left(\mathrm{x}, \mathrm{x}_{\mathrm{i}}\right)-\sum_{\mathrm{i} \in \mathrm{I}_{\mathrm{b}}} \alpha_{\mathrm{i}} \mathrm{k}\left(\mathrm{x}, \mathrm{x}_{\mathrm{i}}\right)$$=\mathrm{b}+\sum_{i=1}^{\mathrm{n}} \beta_{\mathrm{i}} \mathrm{k}\left(\mathrm{x}, \mathrm{x}_{\mathrm{i}}\right)=\mathrm{b}+\sum_{\mathrm{i} \in \mathrm{SV}} \beta_{\mathrm{i}} \mathrm{k}\left(\mathrm{x}, \mathrm{x}_{\mathrm{i}}\right)$ (39)

where, $\mathrm{SV}=\left\{\mathrm{i} \mid \beta_{\mathrm{i}} \neq 0\right\}$ is called support vector set.

4. Experiments

In this section, our proposed method is compared with ε-SVR [4], ε-TSVR [9] and pair v-SVR [11] using 15 benchmark data sets of the UCI repository. The characteristics of these data sets are in accordance with Table 1. The kernel function used in each regression method is Gaussian kernel function. For each data set, the optimal values of the parameters C₁,C₂,C₃,C₄ were selected from the set {0,0.01,…,100}, the optimal value of the parameter ε₁=ε₂=ε was selected from the set {0,0.1}, and the optimal value of the parameter σ was selected from the set {0.1,0.2,…,100} by using the grid search mechanism. These optimal values were reported in Table 2. The RMSE and runtime of each method for the best values of their parameters are according to Table 3 and Table 4, respectively. To calculate RMSE, 10-fold cross validation was used. As it can be seen, the RMSE of our proposed method is competitive with the three other methods and the run time of our proposed method is much less than that of ε-SVR and less than that of ε-TSVR and pair v-SVR. The reason is that in our proposed method instead of solving the constrained quadratic programming problem (11) with 2×n variables, two smaller problems, i.e. the unconstrained quadratic programming problem (2) with n + 1 variables, and then the constrained quadratic programming problem (30) with n variables are solved. Also, in each of ε-TSVR and pair v-SVR, two constrained quadratic programming problem problems with variables are solved. Solving the mentioned unconstrained quadratic problem is faster than the mentioned constrained quadratic problems of the same size. It should be noted that, each problem was solved using MATLAB2015 on a computer of 8GbRAM and 2.20GHz CPU. The quadprog function and the interior-point-convex algorithm were used to solve each constrained quadratic model.

Table 1. Characteristics of 15 UCI datasets

No.	Dataset	Application	#Instance	#Feature
1	Pyrimidine	Regression	74	27
2	Triazines	Regression	186	61
3	Bodyfat	Regression	252	14
4	Haberman	Classification	306	4
5	Yacht Hydrodynamics	Regression	308	7
6	ionosphere	Classification	351	34
7	Housing	Regression	506	14
8	Pima Indians Diabetes	Classification	768	8
9	Concrete Compressive Strength	Regression	1030	9
10	Mg	Regression	1385	7
11	banknote authentication	Classification	1372	5
12	Abalone	Regression	4177	8
13	Wine	Regression	4898	12
14	Wisconsin Breast Cancer	Classification	569	32
15	Forest Fires	Regression	517	13

Note that determination of appropriate values for the parameters of a model is a challenging issue which can be estimated using a validation set or based on prior knowledge which may not be available for each dataset. The Figures 2, 3 and 4 show the sensitivity of RMSE of our proposed method to the parameters C₁, C₂, and σ, respectively. According to Figure 2, for each dataset except “Concrete” dataset, the best RMSE can be obtained by a large value of C₁. Moreover, the sensitivity of RMSE of our proposed method to the parameter C₁ for “Concrete” dataset is small. According to Figure 3, the best RMSE can be obtained by a large value of C₂ for some datasets and the best RMSE can be obtained by a small value of C₂ for the others. But, as it can be seen, the RMSE is not too sensitive to a wide range of values of C₂. For example, the sensitivity of MSE of our proposed method to the parameter C₂ for “Triazines” dataset is about 0.004 for the range [0,100]. According to Figure 4, the RMSE is also not too sensitive to a wide range of the values of σ. For example, the sensitivity of MSE of our proposed method to the parameter σ for “Triazines” dataset is about 0.16 for its whole range, and about 0.01 for the range [20,100].

Table 2. The best parameters of our proposed method with the lowest RMSE

Dataset	σ	C₁	C₂
Pyrimidine	63.4	100	100
Triazines	28.3	100	100
Bodyfat	100	100	20
Haberman	100	100	100
Yacht Hydrodynamics	2	0.01	100
ionosphere	1.4	50	1
Housing	50	0.01	100
Pima Indians Diabetes	49.4	100	1
Concrete Compressive Strength	38	0.01	100
Mg	2	0.01	100
Banknote authentication	1.2	100	10
Abalone	100	100	100
Wine	1	100	100
Wisconsin Breast Cancer	20.1	20	1
Forest Fire	59.3	100	1

Table 3. The mean and standard deviation of RMSE of our proposed method, ε-SVR, ε-TSVR and pair v-SVR

Dataset	ε-SVR	ε-TSVR	pair v-SVR	proposed
Pyrimidine	0.0842± 0.0593	0.0656± 0.0471	0.0675±0 .0367	0.1180± 0.0615
Triazines	0.1503± 0.0338	0.1362± 0.0434	0.1456± 0.0457	0.1512± 0.0332
Bodyfat	4.7757± 0.8102	8.1441± 2.1848	7.6785± 1.5767	6.2215± 1.4876
Haberman	0.8664± 0.1086	0.8775± 0.1395	0.8452± 0.1065	0.8615± 0.1055
Yacht Hydrodynamics	10.9085± 1.2578	10.3360± 1.8207	10.5678± 1.5342	9.6190± 0.9907
Ionosphere	0.5148± 0.1096	0.5022± 0.1283	0.4322± 0.1164	0.4354± 0.1279
Housing	6.7392± 2.1573	8.9263± 3.5525	7.5678± 3.4234	8.5130± 3.3002
Pima Indians Diabetes	0.9112± 0.0640	0.9871± 0.0033	0.9454± 0.0197	0.8425± 0.0946
Concrete Compressive Strength	13.9188± 6.8642	14.6849± 7.1039	13.7545± 6.7433	14.6370± 5.2094
Mg	0.1014± 0.0123	0.1113± 0.0156	0.0945± 0.0154	0.0999± 0.0134
Banknote authentication	0.0857± 0.0184	0.0768± 0.0183	0.0956± 0.0224	0.1178± 0.0209
Abalone	2.0999± 0.6538	2.1265± 0.6173	2.0451± 0.9835	3.1384± 0.9881
Wine	0.8637± 0.0660	0.8685± 0.0576	0.8621± 0.0452	0.8601± 0.0355
Wisconsin Breast Cancer	1.8819± 0.5168	1.8847± 0.5224	1.8664± 0.4767	1.8703± 0.4350
Forest Fires	0.1974± 0.2502	0.1288± 0.2810	0.1198± 0.2576	0.1250± 0.2746

Table 4. The runtime of our proposed method, ε- SVR, ε-TSVR and pair v-SVR (seconds)

Dataset	ε-SVR	ε-TSVR	pair v-SVR	proposed
Pyrimidine	0.41	0.29	0.35	0.12
Triazines	3.44	2.12	2.87	1.71
Bodyfat	5.69	0.71	1.75	0.28
Haberman	8.34	0.78	2.76	0.43
Yacht Hydrodynamics	9.79	0.64	3.65	0.26
ionosphere	9.12	4.20	6.43	3.08
Housing	33.94	2.06	9.34	0.77
Pima Indians Diabetes	40.85	13.45	21.67	12.34
Concrete Compressive Strength	144.2	40.79	31.43	38.32
Mg	187.02	10.26	49.45	4.84
Banknote authentication	252.38	11.82	51.43	11.70
Abalone	14202.98	199.51	199.51	54.07
Wine	3906.94	1384.58	2232.32	920.94
Wisconsin Breast Cancer	2.87	1.92	2.02	1.53
Forest Fires	17.81	3.70	5.21	2.51

21.png

22.png

23.png

Figure 2. The sensitivity of our proposed method to the parameter C₁ for the best C₂ and σ reported in Table 2

31.png

32.png

33.png

Figure 3. The sensitivity of our proposed method to the parameter C₂ for the best C₁ and σ reported in Table 2

41.png

42.png

43.png

Figure 4. The sensitivity of our proposed method to the parameter σ for the best C₁ and C₂ reported in Table 2

5. Conclusions

The RLS method is one of the fastest methods for function estimation because it is an n + 1 variable problem with a close-form solution which is obtained by inverting an n × n matrix. The RLS has a high sensitivity to noisy data. Against this, ε-SVR has a good resistance against noise, but its runtime is not as good as the RLS runtime, because in order to solve the dual ε-SVR, we need to solve a quadratic programming problem with 2×n variables and linear constraints. ε-SVR supposes that the noise in the output variable is at most ε. Therefore, the center of a tube with radius ε, which is used as the estimated function, is determined in a way that the training data are located in that tube. Therefore, this method is robust to such noisy data. In this paper, to improve the runtime of ε-SVR, first, an initial estimated function was obtained using the RLS method. Then, unlike ε-SVR which uses all data to determine the lower and upper limits of the tube, our proposed method used the initial estimated function for determining the tube and the final estimated function. Strictly speaking, the data below and above the initial estimated function were used to estimate the upper and lower limits of the tube, respectively. Thus, the number of the model constraints and, consequently, the model runtime were reduced.

Our proposed method runtime is also lower than that of ε-TSVR and pair v-SVR because in each of ε-TSVR and pair v-SVR, two quadratic programming problems with n-variables and linear constraints are solved, while in our proposed method, an unconstrained quadratic programming problem with n+1 variables and a close form solution, and then, a quadratic programming problem with n-variables and linear constraints are solved. Solving an unconstrained quadratic programming problem is faster than a constrained quadratic programming problem of the same size. Our experiments on 15 benchmark dataset confirm that our proposed method is faster than ε-SVR, ε-TSVR and pair v-SVR, and its accuracy is comparable with the ε-SVR, ε-TSVR and pair v-SVR.

Nomenclature

x_i	i-th training data
y_i	Class label of i-th training data
$\xi_{\mathrm{i}}, \hat{\xi}_{\mathrm{i}}$	Slack variables for i-th training data
n	The number of training data
w	Weight vector of htperplane
b	Bias of hyperplane
$\alpha_{i}, \widehat{\alpha}_{i}$	Lagrange coefficients for i-th training data
φ(.)	A function which map data from the input space into a high-dimensional feature space
k(.,.)	Kernel function
σ	Hyper-parameter of Gaussian kernel function
C₁,C₂,C₃,C₄	Hyper-parameters of models to control penalty terms
ε	Size of tube

References

[1] Otto Bretscher, S. (1997). Linear Algebra with Applications. USA: Prentice-Hall International Inc.

[2] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1): 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

[3] Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2): 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

[4] Vapnik, V. (2013). The nature of statistical learning theory. Springer Science & Business Media.

[5] Schölkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L. (2000). New support vector algorithms. Neural Computation, 12(5): 1207-1245. https://doi.org/10.1162/089976600300015565

[6] Jeng, J.T., Chuang, C.C., Su, S.F. (2003). Support vector interval regression networks for interval regression analysis. Fuzzy Sets and Systems, 138(2): 283-300. https://doi.org/10.1016/S0165-0114(02)00570-5

[7] Hao, P.Y. (2009). Interval regression analysis using support vector networks. Fuzzy Sets and Systems, 160(17): 2466-2485. https://doi.org/10.1016/j.fss.2008.10.012

[8] Peng, X. (2010). TSVR: An efficient twin support vector machine for regression. Neural Networks, 23(3): 365-372. https://doi.org/10.1016/j.neunet.2009.07.002

[9] Shao, Y.H., Zhang, C.H., Yang, Z.M., Jing, L., Deng, N.Y. (2013). An ε-twin support vector machine for regression. Neural Computing and Applications, 23(1): 175-185. https://doi.org/10.1007/s00521-012-0924-3

[10] Peng, X., Xu, D., Kong, L.Y., Chen, D.J. (2016). L1-norm loss based twin support vector machine for data recognition. Information Sciences, 340-341: 86-103. https://doi.org/10.1016/j.ins.2016.01.023

[11] Hao, P.Y. (2017). Pair v-SVR: A novel and efficient pairing nu-support vector regression algorithm. IEEE Transactions on Neural Networks and Learning Systems, 28(11): 2503-2515. https://doi.org/10.1109/TNNLS.2016.2598182

[12] Huang, X., Shi, L., Pelckmans, K., Suykens, J.A.K. (2014). Asymmetric ν-tube support vector regression. Computational Statistics & Data Analysis, 77: 371-382. https://doi.org/10.1016/j.csda.2014.03.016

[13] Xu, Y., Li, X.Y., Pan, X.L., Yang, Z.J. (2017). Asymmetric ν-twin support vector regression. Neural Computing and Applications, 30: 1-16. https://doi.org/10.1007/s00521-017-2966-z

[14] Kuhn, H.W., Tucker, A.W. (2014). Nonlinear programming, in Traces and emergence of nonlinear programming. Springer, pp. 247-258.

IJHT
MMEP
ACSM
EJEE
ISI
I2M
JESA
RCMA
RIA
TS
IJSDP
IJSSE
IJDNE
JNMES
IJES
EESRJ
RCES
AMA_A
AMA_B
AMA_C
AMA_D
MMC_A
MMC_B
MMC_C
MMC_D

Username
Password
Remember me

Search form

Improving the Speed of Support Vector Regression Using Regularized Least Square Regression