A Feature Selection Methodology for Steganalysis
Une Méthodologie pour la Sélection de Variables pour la Stéganalyse
Steganography has been known and used for a very long time, as a way to exchange information in an unnoticeable manner between parties, by embedding it in another, apparently innocuous, document. Nowadays steganographic techniques are mostly used on digital content. The online newspaper Wired News, reported in one of its articles  on steganography that several steganographic contents have been found on web sites with very large image database such as eBay. Niels Provos  has somewhat refuted these ideas by analyzing and classifying two million images from eBay and one million from USENet network and not finding any steganographic content embedded in these images. This could be due to many reasons, such as very low payloads, making the steganographic images very robust and secure to steganalysis.
The security of a steganographic scheme has been defined theoretically by Cachin in  but this definition is very seldomly usable in practice. It requires to evaluate distributions and measure the Kullback-Leibler divergence between them.
In practice, steganalysis is used as a way to evaluate the security of a steganographic scheme empirically: it aims at detecting whether a medium has been tampered with – but not to detect what is in the medium or how it has been embedded. By the use of features, one can get some relevant characteristics of the considered medium, and assess, by the use of machine learning tools, usually, whether the medium is genuine or not. This is only one way to perform steganalysis, but it remains the most common.
One of the main issues with this scheme is that people tend to use more and more features extracted from the media (we consider only JPEG images in this article) in order to increase the performances of detection of modified images. This number of features corresponds to the dimensionality of the space in which are performed machine learning processes (typically, training of a classifier). This usually leads to very high dimensional spaces for which many problems arise (in comparison to low dimensional spaces): mainly, the required number of images to have an appropriate filling of the space in which the classifier is trained, is never reached. This filling is required for the classifier to train on properly distributed data among the feature space. Also, when the number of features is too high, interpretation of the most relevant features becomes very difficult if not to say impossible.
In this article, some of the problems encountered because of the high dimensionality of the problem usually met in steganalysis, are presented, along with possible solutions.
To the problem of the required number of images for filling the space, is proposed an evaluation of a sufficient number of images: a bootstrap algorithm is used to estimate the variance of the classifier’s results for different amounts of images. Once the variance is low enough to have accurate results, the number of images required for that number of features is attained.
With this sufficient number of images, feature selection is then performed, with a forward algorithm, in an attempt to decrease the dimensionality and also to gain interpretability over which features have been reacting the most. Hence, a knowledge of the steganographic’s scheme can be inferred and its scheme could be modified accordingly to improve its security.
These ideas are combined in a methodology, which is tested on 6 different steganographic algorithms, for different sizes of the embedded information. The result is an estimation of the sufficient number of images for obtaining results with low enough variance. Selected sets of features also enable to keep the same performances (within the small variance range) while providing insights on the weaknesses of each algorithm. These weaknesses are analyzed separately for each algorithm.
In conclusion, the proposed methodology enabled to estimate the variance of typically given results for steganalysis, along with added interpretability. The proposed reduced sets of features have also made it possible to keep the same performances as for the full set.
Le principe de la stéganalyse est de classer un document incriminé comme original ou comme stéganographié. Cet article propose une méthodologie pour la stéganalyse utilisant la sélection de caractéristiques, orientée vers une diminution des intervales de confiance des résultats habituellement donnés. La sélection de caractéristiques permet également d’envisager une interprétation des caractéristiques d’images sélectionnées, dans le but de comprendre le fonctionnement intrinsèque des algorithmes de stéganographie. Il est montré que l’écart type des résultats obtenus habituellement en classification peut être très important (jusqu’à 5 %) lorsque des ensembles d’entrainements comportant trop peu d’échantillons sont utilisés. Ces tests sont menés sur six algorithmes de stéganographie, utilisés avec quatre taux d’insertions différents : 5, 10, 15 et 20 %. D’autre part, les caractéristiques sélectionnées (généralement 10 à 13 fois moins nombreuses que dans l’ensemble complet) permettent effectivement de faire ressortir les faiblesses ainsi que les avantages des algorithmes utilisés.
Stéganographie, sténaganalyse, sélection de variables, méthodologie
 R. BELLMAN, Adaptive control processes: a guided tour. Princeton University Press, 1961.
 C. CACHIN. An information-theoretic model for steganography. In Information Hiding: 2nd International Workshop, volume 1525 of Lecture Notes in Computer Science, pages 306-318, 14-17 April 1998.
 JPEG Comittee, http://www. jpeg. com.
 D. FRANÇOIS, High-dimensional data analysis: optimal metrics and feature selection. PhD thesis, Université catholique de Louvain, September 2006.
 J. FRIDRICH, Feature-based steganalysis for jpeg images and its implications for future design of steganographic schemes. In Information Hiding : 6th International Workshop, volume 3200 of Lecture Notes in Computer Science, pages 67-81, May 23-25 2004.
 S. HETZL and P. MUTZEL, A graph-theoretic approach to steganography. In Dittmann J., Katzenbeisser S., and Uhl A., editors, CMS 2005, Lecture Notes in Computer Science 3677, pages 119-128. Springer-Verlag, 2005.
 G.-B. HUANG, Q.-Y. ZHU, and C.-K. SIEW, Extreme learning machine: Theory and applications. Neurocomputing, 70(1-3):489-501, December 2006.
 B. EFRON R.J. and TIBSHIRANI, An Introduction to the Bootstrap. Chapman et al., Londres, 1994.
 Y. KIM, Z. DURIC, and D. RICHARDS, Modified matrix encoding technique for minimal distortion steganography. In Information Hiding 2007, volume 4437/2007, pages 314-327, 2007.
 A. LATHAM, Jphide&seek, August 1999. http://linux01. gwdg. de/alatham/stego. html.
 S. LYU and H. FARID, Detecting hidden messages using higherorder statistics and support vector machines. In 5th International Workshop on Information Hiding, Noordwijkerhout, The Netherlands, 2002.
 D. MCCULLAGH, Secret Messages Come in . Wavs. Online Newspaper: Wired News, February 2001. http://www. wired. com-/news/politics/ 0,1283,41861,00. html.
 Y. MICHE, P. BAS, C. JUTTEN, O. SIMULA, and A. LENDASSE, A methodology for building regression models using extreme learning machine: OP-ELM. In ESANN 2008, European Symposium on Artificial Neural Networks, Bruges, Belgium, April 23-25 2008. to be published.
 Y. MICHE, P. BAS, A. LENDASSE, C. JUTTEN, and O. SIMULA, Extracting relevant features of steganographic schemes by feature selection techniques. In Wacha'07: Third Wavilla Challenge, June 14 2007.
 Y. MICHE, B. ROUE, P. BAS, and A. LENDASSE, A feature selection methodology for steganalysis. In MRCS06, International Workshop on Multimedia Content Representation, Classification and Security, Istanbul (Turkey), Lecture Notes in Computer Science. Springer-Verlag, September 11-13 2006.
 T. PEVNY and J. FRIDRICH, Merging markov and dct features for multi-class jpeg steganalysis. In IS&T/SPIE 19th Annual Symposium Electronic Imaging Science and Technology, volume 6505 of Lecture Notes in Computer Science, January 29th - February 1st 2007.
 N. PROVOS, Defending against statistical steganalysis. In 10th USENIX Security Symposium, pages 323-335, 13-17 April 2001.
 N. PROVOS and P. HONEYMAN, Detecting steganographic content on the internet. In Network and Distributed System Security Symposium. The Internet Society, 2002.
 F. ROSSI, A. LENDASSE, D. FRANÇOIS, V. WERTZ, and M. VERLEYSEN, Mutual information for the selection of relevant variables in spectrometric nonlinear modelling. Chemometrics and Intelligent Laboratory Systems, 80:215-226, 2006.
 P. SALLEE, Model-based steganography. In Digital Watermarking, volume 2939/2004 of Lecture Notes in Computer Science, pages 154-167. Springer Berlin/Heidelberg, 2004.
 Y.Q. SHI, C. CHEN, and W. CHEN, A markov process based approach to effective attacking jpeg steganography. In ICME'06: Internation Conference on Multimedia and Expo, Lecture Notes in Computer Science, 9-12 July 2006.
 A. SORJAMAA, Y. MICHE, and A. LENDASSE, Long-term prediction of time series using nne-based projection and op-elm. In IJCNN2008: International Joint Conference on Neural Networks, June 2008. to be published.
 M. VERLEYSEN and D. FRANÇOIS, The curse of dimensionality in data mining and time series prediction. In IWANN'05 : 8th International Work-Conference on Artificial Neural Network, volume 3512 of Lecture Notes in Computer Science, pages 758-770, June 8-10 2005.
 A. WESTFELD, F5-a steganographic algorithm. In Information Hiding: 4th International Workshop, volume 2137, pages 289-302, 25-27 Avril 2001.
 A. WESTFELD and A. PFITZMANN, Attacks on steganographic systems. In IH '99: Proceedings of the Third International Workshop on Information Hiding, pages 61-76, London, UK, 2000. Springer-Verlag.