OPEN ACCESS
With the growth of social media as the most important element of internet in term of visitors, fake accounts detection has become one of the hardest social media security challenges. Over the years, online social media (OSN) have evolved widely, converting part of our personal lives to virtual ones. But this evolution also has negative effects. In 2012, 16.6 million of Americans were victims of identity theft according to an estimate from the U.S. Bureau of Justice Statistics, with up to $24.7 billion of financial losses for these victims. Various techniques are used to manipulate users in OSN environments such as social spam, identity theft, spear phishing and Sybil attacks... In this article, we are interested in analyzing the behavior of multiple fake accounts that try to bypass the OSN regulation. In the context of social media manipulation detection, we focus on the special case of multiple Identity accounts (Sockpuppet) created on English Wikipedia (EnWiki). We set up a complete methodology spanning from the data extraction from EnWiki to the training and testing of our selected data using several machine learning algorithms. In our methodology we propose a set of features that grows on previous literature to use in automatic data analysis in order to detect the Sockpuppets accounts created on EnWiki. We apply them on a database of 10 000 user accounts. The results compare several machine learning algorithms to show that our new features and training data enable to detect 99 % of fake accounts, improving previous results from the literature.
sockpuppet, machine learning application, manipulation, deception, identity, wikipedia, collaborative project, social media
Altman N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, vol. 46, no 3, p. 175–185.
Ambika C. M. (2014, December). The evolution of social media 2004 - 2014: The good, the bad and the ugly of it ! (http://dazeinfo.com/2014/12/12/evolution-social-media-2004-2014-good-bad-ugly/)
Breiman L. (2001). Random forests. Machine learning, vol. 45, no 1, p. 5–32.
Cao Q., Sirivianos M., Yang X., Pregueiro T. (2012). Aiding the detection of fake accounts in large scale social online services. In Proceedings of the 9th usenix conference on networked systems design and implementation, p. 15–15.
Cortes C., Vapnik V. (1995). Support-vector networks. Machine learning, vol. 20, no 3, p. 273–297.
David B. (2015, MARS). 5 social engineering attacks to watch out for. (http://tripwire.com/state-of-security/security-awareness/5-social-engineering-attacks-to-watch-out-for/)
Douceur J. R. (2002). The sybil attack. In Peer-to-peer systems, p. 251–260. Springer. Freund Y., Schapire R. E. (1995). A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, p. 23–37.
Gao H., Hu J., Wilson C., Li Z., Chen Y., Zhao B. Y. (2010). Detecting and characterizing social spam campaigns. In Proceedings of the 10th acm sigcomm conference on internet measurement, p. 35–47.
Goolsby R., Shanley L., Lovell A. (2013). On cybersecurity, crowdsourcing, and social cyberattack. Rapport technique. DTIC Document.
Heckerman D. (2008). A tutorial on learning with bayesian networks. In Innovations in bayesian networks, p. 33–82. Springer.
Jeff B. (2015). 33 social media facts and statistics you should know in 2015. (http://www.jeffbullas.com/2015/ 04/08/ 33-social-media-facts-and-statistics-you-shouldknow-in-2015/)
Kaplan A. M., Haenlein M. (2010). Users of the world, unite! the challenges and opportunities of social media. Business horizons, vol. 53, no 1, p. 59–68.
Maeve D., Nicole E., Cliff L., Amanda L., Mary M. (2015, January). Social media update 2014. (http://www.pewinternet.org/2015/01/09/social-media-update-2014/) Mathew I. (2012, February). If you think twitter doesn’t break news, you’re living in a dream world. (https://gigaom.com/2012/02/29/if-you-think-twitter-doesnt-break-newsyoure- living-in-a-dream-world/)
Norajong. (2010, May). Why the number of people creating fake accounts and using second identity on facebook are increasing. (http://networkconference.netstudies.org/2010/05/ why-the-number-of-people-creating-fake-accounts-and-using-second-identity-onfacebook-
are-increasing/)
Norton. (s. d.). Spear phishing: Scam, not sport. (http://us.norton.com/spear-phishing-scamnot-sport/article)
Riva R. (2010, May). Stolen facebook accounts for sale. (http://www.nytimes.com/2010/05/03/technology/internet/ 03facebook.html)
Russell S., Norvig P., Intelligence A. (1995). A modern approach. Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, vol. 25, p. 27.
Sarita Y., Daniel R., Schoenebeck G., danah b. (2009). Detecting spam in a twitter network. First Monday, vol. 15, no 1. Consulté sur http://firstmonday.org/ojs/index.php/fm/article/view/2793
Solorio T., Hasan R., Mizan M. (2013a). A case study of sockpuppet detection in wikipedia. In Workshop on language analysis in social media (lasm) at naacl hlt, p. 59–68.
Solorio T., Hasan R., Mizan M. (2013b). Sockpuppet detection in wikipedia: A corpus of real-world deceptive writing for linking identities. arXiv preprint arXiv:1310.6772.
Statista. (2015). Number of unique u.s. visitors to wikipedia.org from may 2011 to april 2015 (in millions). (http://www.statista.com/statistics/265119/number-of-unique-us-visitors-towikipediaorg/)
Sture N. (2010, February). Fake accounts in facebook - how to counter it. (http://ezinearticles.com/?id=3703889)
Tsikerdekis M., Zeadally S. (2014). Multiple account identity deception detection in social media using nonverbal behavior. Information Forensics and Security, IEEE Transactions on, vol. 9, no 8, p. 1311–1321.
Yang Z., Wilson C., Wang X., Gao T., Zhao B. Y., Dai Y. (2014). Uncovering social network sybils in the wild. ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 8, no 1, p. 2.