Applying Hybrid Feature Selection Methods for Statistical Modelling of Roadside Particle Concentrations (PM2.5 and PNC)
The task of selecting a predictor variable to include in statistical models is enormous. A model built with fewer predictor variables can be more interpretable and less expensive than the one built with many input variables. In this study, the effects of hybrid feature selection methods (genetic algorithms [GA] and simulated annealing (SA) each combined with random forests [RF]) in improving the efficiency of five variants of multiple linear regression models in the prediction of roadside PM2.5 and particle number count (PNC) concentrations are investigated. The GA-RF and SA-RF selected 9 and 16 variables, respectively, of the 27 predictor variables in the PM2.5 training data. Thirteen variables were selected by the GA-RF of the 25 possible variables in the PNC training data, while the SA-RF selected 13 variables.The methods selected variables that are nearly the same especially for predicting PNC, while for the PM2.5 models the SA-RF selected 16 variables and the GA-RF selected only 10 variables. The hybrid feature selection methods eliminated most of the correlated variables, especially the background pollutants and the traffic variables. Whereas the temporal variables and the meteorological variable have been selected in all the cases considered. The statistical performance of the linear models with the selected variables is similar to those developed using the entire predictor variables. The actual benefit derived from this study is the successful reduction in the number of predictor variables by more than half in most of the cases considered. The reduction in the number of variables will eventually result in the reduction of the operational and computational cost of the models without possibly compromising the predictive performance of the models. Also, the reduction in the number of variables will enhance interpretability.
air quality, genetic algorithms (GA), particulate matter, random forests (RF), simulated annealing (SA), statistical modelling
 Kuhn, M. & Johnson, K., Applied Predictive Modeling, Springer, 2013.
 LondonAir. (2013, 03/04/2013). London Air quality Network. Available: http://www.londonair.org.uk/london/asp/datadownload.asp
 Benas, N., Beloconi, A. & Chrysoulakis, N., Estimation of urban PM10 concentration, based on MODIS and MERIS/AATSR synergistic observations. Atmospheric Environment, 79, pp. 448–454, Nov 2013.
 Chen, Y. Y., Shi, R. H., Shu, S. J. & Gao, W., Ensemble and enhanced PM10 concentration forecast model based on stepwise regression and wavelet analysis. Atmospheric Environment, 74, pp. 346–359, Aug 2013.
 de Paula, P. H. M., Mateus, V. L., Araripe, D. R., Duyck, C. B., Saint’Pierre, T. D. & Gioda, A., Biomonitoring of metals for air pollution assessment using a hemiepiphyte herb (Struthanthus flexicaulis). Chemosphere, 138, pp. 429–437, Nov 2015.
 Deka, P., Bhuyan, P., Daimari, R., Sarma, K. P. & Hoque, R. R., Metallic species in PM10 and source apportionment using PCA-MLR modeling over mid-Brahmaputra Valley. Arabian Journal of Geosciences, 9, May 2016.
 Guo, X. Y., Li, C., Gao, Y., Tang, L., Briki, M., Ding, H. J., et al., Sources of organic matter (PAHs and n-alkanes) in PM2.5 of Beijing in haze weather analyzed by combining the C-N isotopic and PCA-MLR analyses. Environmental Science-Processes & Impacts, 18, pp. 314–322, 2016.
 He, H. D., Lu, W. Z., & Xue, Y. Prediction of particulate matters at urban intersection by using multilayer perceptron model based on principal components. Stochastic Environmental Research and Risk Assessment, 29, pp. 2107–2114, Dec 2015.
 James, G., Witten, D. & Hastie, T., An Introduction to Statistical Learning: With Applications in R. ed, 2014.
 Karamizadeh, S., Abdullah, S. M., Manaf, A. A., Zamani, M. & Hooman, A., An Overview of Principal Component Analysis. Journal of Signal and Information Processing, 4, p. 173, 2013.
 Singh, K. P., Gupta, S., Kumar, A. & Shukla, S. P. Linear and nonlinear modeling approaches for urban air quality prediction. Science of the Total Environment, 426, pp. 244–255, Jun 2012.
 Chen, Y., Shi, R., Shu, S. & Gao, W., Ensemble and enhanced PM10 concentration forecast model based on stepwise regression and wavelet analysis. Atmospheric Environment, 74, pp. 346–359, 8// 2013.
 Whittingham, M. J., Stephens, P. A., Bradbury, R. B. & Freckleton, R. P., Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75, pp. 1182–1189, 2006.
 Banerjee, T., Singh, S. B. & Srivastava, R. K., Development and performance evaluation of statistical models correlating air pollutants and meteorological variables at Pantnagar, India. Atmospheric Research, 99, pp. 505–517, Mar 2011.
 Brown, T., Dassonville, C., Derbez, M., Ramalho, O., Kirchner, S., Crump, D., et al., Relationships between socioeconomic and lifestyle factors and indoor air quality in French dwellings. Environmental Research, 140, pp. 385–396, 7// 2015.
 Diaz-de-Quijano, M., Joly, D., Gilbert, D. & Bernard, N., A more cost-effective geomatic approach to modelling PM10 dispersion across Europe. Applied Geography, 55, pp. 108–116, 12// 2014.
 Krivtsov, V., Howarth, M. J. & Jones, S. E., Characterising observed patterns of suspended particulate matter and relationships with oceanographic and meteorological variables: Studies in Liverpool Bay. Environmental Modelling & Software, 24, pp. 677–685, Jun 2009.
 H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, pp. 301–320, 2005.
 Simons, K., De Smedt, T., Van Nieuwenhuyse, A., Buyl, R. & Coomans, D., Ensemble post-processing is a promising method to obtain flexible distributed lag models. Air Quality, Atmosphere & Health, pp. 1–12, 2016.
 Suleiman, A., Tight, M. R. & Quinn, A. D. Hybrid Neural Networks and Boosted Regression Tree Models for Predicting Roadside Particulate Matter. Environmental Modeling & Assessment, pp. 1–20, 2016.
 Fouskakis, D. & Draper, D., Stochastic optimization: a review. International Statistical Review, 70, pp. 315–349, 2002.
 Kuhn, M., The caret Package. 2012.
 R Development Core Team, “R 3.2. 1,” ed: R Project for Statistical Computing Vienna, Austria, 2015.
 Lin, S.-W., Tseng, T.-Y., Chou, S.-Y., & Chen, S.-C., A simulated-annealing-based approach for simultaneous parameter optimization and feature selection of back-propagation networks. Expert Systems with Applications, 34, pp. 1491–1499, 2// 2008.
 Breiman, L., Random forests. Machine learning, 45, pp. 5–32, 2001.
 Carslaw, D. C. & Ropkins, K., openair — An R package for air quality data analysis. Environmental Modelling & Software, 27–28, pp. 52–61, 2012.