Semi-Automatic formalization of a patient/doctor vocabulary for breast cancer

Semi-Automatic formalization of a patient/doctor vocabulary for breast cancer

Mike Donald Tapi Nzali Jérôme Azé Sandra Bringay Christian Lavergne Caroline Mollevi Thomas Opitz 

Université de Montpellier, France

Université Paul Valéry, Montpellier 3, France

Institut Montpelliérain Alexander Grothendieck, France

Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier, France

Institut du Cancer Montpellier, Montpellier, France

Biostatistique et Processus Spatiaux, INRA Avignon, France

Corresponding Author Email:,,,,,
31 October 2016
| Citation



Nowadays, social media is increasingly used by patients and health professionals. Most often, the patients are lay in the medical field, they use slang, abbreviations, and their own vocabulary during their exchanges. In order to automatically analyze texts from social networks, we need a specific vocabulary. Considering a corpus of documents from messages from social media like forums and Facebook, we describe the construction of a lexical resource that aligns the vocabulary of patients to that of health professionals. In order to build this resource and transform it into a SKOS ontology, we use several methods taking into account the linguistic and statistical aspects proposed in the literature. On the one hand, this work will improve information retrieval in health forums and on the other hand it will facilitate the development of statistical studies based on information extracted from these forums.


information extraction, social media, statistic-based measure, ontology, patient vocabulary.

1. Introduction
2. Motivations et état de l’art
3. Méthodes
4. Résultats
5. Formalisation de la ressource sous la forme d’une ontologie en SKOS
6. Discussion
7. Conclusion et perspectives

Ces travaux ont été financés par l’ANR SFIR (Semantic Indexing of French Biomedical Data Resources) et par par l’Institut de Recherche en Santé Publique (http:/ /


Bouamor D., Llanos L. C., Ligozat A.-L., Rosset S., Zweigenbaum P. (2016). Transfer-based learning-to-rank assessment of medical term technicality. In Proceedings of the Tenth International Conference on Language Resources and Evaluation, p. 2312–2316.

Buscaldi D., Rosso P. (2006). Mining knowledge from wikipedia for the question answering task. In Proceedings of the International Conference on Language Resources and Evaluation, p. 727–730.

Chernov S., Iofciu T., Nejdl W., Zhou X. (2006). Extracting semantics relationships between wikipedia categories. Semantic Wiki, vol. 206, p. 153-163.

Cilibrasi R. L., Vitanyi P. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, vol. 19, no 3, p. 370–383.

Dice L. R. (1945). Measures of the amount of ecologic association between species. Ecology, vol. 26, no 3, p. 297–302.

Doing-Harris K. M., Zeng-Treitler Q. (2011). Computer-assisted update of a consumer health vocabulary through mining of social network data. Journal of Medical Internet Research, vol. 13, no 2, p. e37.

Elhadad N., Zhang S., Driscoll P., Brody S. (2014). Characterizing the sublanguage of online breast cancer forums for medications, symptoms, and emotions. In American Medical Informatics Association, Annual Symposium, p. 516-525.

Fiscella K., Meldrum S., Franks P., Shields C. G., Duberstein P., McDaniel S. H. et al. (2004). Patient trust: is it related to patient-centered behavior of primary care physicians? Medical Care, vol. 42, no 11, p. 1049–1055.

Gabrilovich E., Markovitch S. (2007). Computing Semantic Relatedness Using Wikipediabased Explicit Semantic Analysis. International Joint Conference on Artificial Intelligence, vol. 7, p. 1606–1611.

Hamon T., Grabar N. (2015). Acquisition of medical terminology for ukrainian from parallel corpora and wikipedia. In Terminologie Intelligence Artificielle, p. 71-79.

Hancock J. T., Toma C., Ellison N. (2007). The truth about lying in online dating profiles. In Proceedings of the SIGCHI conference on Human factors in computing systems, p. 449–452.

Islam A., Milios E. E., Keselj V. (2012). Comparing Word Relatedness Measures Based on Google n-grams. In International Conference on Computational Linguistics, p. 495-506.

Lafourcade M., Joubert A. (2012). Increasing long tail in weighted lexical networks. In Cognitive Aspects of the Lexicon, International Conference on Computational Linguistics, p. 5-20.

Lossio-Ventura J. A., Jonquet C., Roche M., Teisseire M. (2014a). Biotex: A system for biomedical terminology extraction, ranking, and validation. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track-Volume 1272, p. 157–160.

Lossio-Ventura J. A., Jonquet C., Roche M., Teisseire M. (2014b). Integration of linguistic and web information to improve biomedical terminology extraction. In Proceedings of the 18th International Database Engineering & Applications Symposium, p. 265–269.

Lossio-Ventura J. A., Jonquet C., Roche M., Teisseire M. (2014c). Yet another ranking function for automatic multiword term extraction. In International Conference on Natural Language Processing, p. 52–64.

Lossio-Ventura J. A., Jonquet C., Roche M., Teisseire M. (2016). Biomedical term extraction: overview and a new methodology. Information Retrieval Journal, vol. 19, no 1-2, p. 59–99.

Lu K., Mao J., Li G. (2015). Enhancing subject metadata with automated weighting in the medical domain: A comparison of different measures. In International Conference on Asian Digital Libraries, p. 158–168.

MacLean D. L., Heer J. (2013). Identifying medical terms in patient-authored text: a crowdsourcing-based approach. Journal of the American Medical Informatics Association, vol. 20, no 6, p. 1120–1127.

Merolli M., Gray K., Martin-Sanchez F. (2013). Health outcomes and related effects of using social media in chronic disease management: A literature review and analysis of affordances. Journal of Biomedical Informatics, vol. 46, no 6, p. 957–969.

Miles A., Bechhofer S. (2005). Skos simple knowledge organization system reference. In W3C Recommendation, World Wide Web Consortium,, consulté le 18 février 2016. Consulté sur,18August2009

Nalawade R., Samal A., Avhad K. (2016). Improved similarity measure for text classification and clustering. In International Research Journal of Engineering and Technology, p. 214–219.

Noy N. F., Shah N. H., Whetzel P. L., Dai B., Dorf M., Griffith N. et al. (2009). Bioportal: ontologies and integrated data resources at the click of a mouse. In Nucleic Acids Research, p. 170-173. Oxford Univ Press.

Opitz T., Azé J., Bringay S., Joutard C., Lavergne C., Mollevi C. (2014). Breast cancer and quality of life: medical information extraction from health forums. In Medical Informatics Europe, p. 1070–1074.

Paternostre M., Francq P., Lamoral J., Wartel D., Saerens M. (2002). Carry, un algorithme de désuffixation pour le français. Technical report, Paul Otlet Institute, 15 pages.

Ponzetto S. P., Strube M. (2006). Exploiting semantic role labeling, wordnet and wikipedia for coreference resolution. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p. 192–199.

Ramesh B. P., Houston T. K., Brandt C., Fang H., Yu H. (2013). Improving patients’ electronic health record comprehension with noteaid. In World Congress on Health and Biomedical Informatics, p. 714–718.

Sadilek A., Kautz H. A., Silenzio V. (2012). Modeling spread of disease from social interactions. In International Conference on Weblogs and Social Media, p. 322–329.

Solomou G., Papatheodorou T. (2010). The use of SKOS vocabularies in digital repositories: the DSpace case. In International Conference on Semantic Computing, p. 542–547.

Summers E., Isaac A., Redding C., Krech D. (2008). Lcsh, skos and linked data. In International Conference on Dublin Core and Metadata Applications, p. 25-33.

Van Assem M., Malaisé V., Miles A., Schreiber G. (2006). A method to convert thesauri to skos. In European Semantic Web Conference, p. 95-109. Springer.

Wang P., Hu J., Zeng H.-J., Chen Z. (2009). Using wikipedia knowledge to improve text classification. Knowledge and Information Systems, vol. 19, no 3, p. 265–281.

Witten I., Milne D. (2008). An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, AAAI Press, Chicago, USA, p. 25–30.

Wu D. T., Hanauer D. A., Mei Q., Clark P. M., An L. C., Lei J. et al. (2013). Applying multiple methods to assess the readability of a large corpus of medical documents. In World Congress on Health and Biomedical Informatics, p. 647–651.

Zadeh R. B., Goel A. (2013). Dimension independent similarity computation. The Journal of Machine Learning Research, vol. 14, no 1, p. 1605–1626.

Zeng Q. T., Tse T. (2006). Exploring and developing consumer health vocabularies. Journal of the American Medical Informatics Association, vol. 13, no 1, p. 24–29.

Zeng Q. T., Tse T., Divita G., Keselman A., Crowell J., Browne A. C. et al. (2007). Term identification methods for consumer health vocabulary development. Journal of Medical Internet Research, vol. 9, no 1, p. e4.

Zheng Y., Mobasher B., Burke R. (2015). Integrating context similarity with sparse linear recommendation model. In International Conference on User Modeling, Adaptation, and Personalization, p. 370–376.