Détection de liens d’identité erronés en utilisant la détection de communautés dans les graphes d’identité

Détection de liens d’identité erronés en utilisant la détection de communautés dans les graphes d’identité

Joe Raad Wouter Beek  Nathalie Pernelle  Fatiha Saïs  Frank van Harmelen 

UMR MIA-PARIS, INRA, AgroParisTech, Université Paris-Saclay Paris, France

LRI, CNRS UMR8623, Paris Sud University, Paris Saclay University Orsay, France

Dept. of Computer Science, VU University Amsterdam Amsterdam, Pays-Bas

Corresponding Author Email: 
joe.raad@agroparistech.fr; {nathalie.pernelle,fatiha.sais}@lri.fr; {w.g.j.beek,frank.van.harmelen}@vu.nl
28 August 2018
| Citation

Different studies have observed that the semantic web identity predicate owl:SameAs is sometimes used incorrectly. In this paper, we show how network metrics such as the community structure of the owl:SameAs graph can be used in order to detect such possibly erroneous statements. One benefit of the here presented approach is that it can be applied to the network of owl:SameAs links, and does not rely on any additional knowledge. We evaluate our approach on 558M owl:SameAs statements scraped from the LOD cloud. This evaluation shows the ability of our approach to scale, and its efficiency in detecting erroneous identity links.


Web of data, identity, owl:sameAs, communities

1. Introduction
2. Travaux Connexes
3. Approche de détection de liens d’identité erronés
4. Expérimentations
5. Conclusion

Beek W., Raad J., Wielemaker J., Harmelen F. van. (2018). sameas.cc: The closure of 500m owl: sameas statements. In The semantic web - 15th international conference, ESWC 2018, heraklion, crete, greece, june 3-7, 2018, proceedings, p. 65–80. Consulté sur https://doi.org/10.1007/978-3-319-93417-4\_5

Beek W., Rietveld L., Bazoobandi H. R., Wielemaker J., Schlobach S. (2014). Lod laundromat: a uniform way of publishing other people’s dirty data. In International semantic web conference, p. 213–228.

Beek W., Schlobach S., Harmelen F. van. (2016). A contextualised semantics for owl: sameas. In International semantic web conference, p. 405–419.

Blondel V., Guillaume J.-L., Lambiotte R., Lefebvre E. (2008). Fast unfolding of communities in large networks. J. of statistical mechanics, vol. 2008, no 10, p. P10008.

Cudré-Mauroux P., Haghani P., Jost M., Aberer K., De Meer H. (2009). idmesh: graph-based disambiguation of linked data. In Proceedings of the 18th international conference on world wide web, p. 591–600.

Cuzzola J., Bagheri E., Jovanovic J. (2015). Filtering inaccurate entity co-references on the linked open data. In International conference on database and expert systems applications, p. 128–143.

Fernández J. D., Beek W., Martínez-Prieto M. A., Arias M. (2017). Lod-a-lot. In International semantic web conference, p. 75–83.

Fortunato S. (2010). Community detection in graphs. Physics reports, vol. 486, no 3-5, p. 75–174.

Girvan M., Newman M. E. (2002). Community structure in social and biological networks. Proceedings of the national academy of sciences, vol. 99, no 12, p. 7821–7826.

Guéret C., Groth P., Stadler C., Lehmann J. (2012). Assessing linked data mappings using network measures. In Extended semantic web conference, p. 87–102.

Halpin H., Hayes P. J., McCusker J. P., McGuinness D. L., Thompson H. S. (2010). When owl: sameas isn’t the same: An analysis of identity in linked data. In International semantic web conference, p. 305–320.

Halpin H., Hayes P. J., Thompson H. S. (2015). When owl: sameas isn’t the same redux: towards a theory of identity, context, and inference on the semantic web. In International and interdisciplinary conference on modeling and using context, p. 47–60.

Hogan A., Zimmermann A., Umbrich J., Polleres A., Decker S. (2012). Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics: Science, Services and Agents on the World Wide Web, vol. 10, p. 76–110.

Jaffri A., Glaser H., Millard I. (2008). URI disambiguation in the context of Linked Data. In Linked data on the web workshop (ldow).

Lancichinetti A., Fortunato S. (2009a). Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E, vol. 80, no 1, p. 016118.

Lancichinetti A., Fortunato S. (2009b). Community detection algorithms: a comparative analysis. Physical review E, vol. 80, no 5, p. 056117.

Lancichinetti A., Fortunato S., Radicchi F. (2008). Benchmark graphs for testing community detection algorithms. Physical review E, vol. 78, no 4, p. 046110.

Melo G. de. (2013). Not quite the same: Identity constraints for the web of linked data. In M. desJardins, M. L. Littman (Eds.), Aaai. AAAI Press.

Nentwig M., Hartung M., Ngomo A. N., Rahm E. (2017). A survey of current link discovery frameworks. Semantic Web, vol. 8, no 3, p. 419–436. Consulté sur https://doi.org/10.3233/SW-150210

Newman M. E., Girvan M. (2004). Finding and evaluating community structure in networks. Physical review E, vol. 69, no 2, p. 026113.

Papaleo L., Pernelle N., Saïs F., Dumont C. (2014). Logical detection of invalid sameas statements in rdf data. In International conference on knowledge engineering and knowledge management, p. 373–384.

Patel-Schneider P. F., Hayes P., Horrocks I. (2004, 31 décembre). OWL Web Ontology Language Semantics and Abstract Syntax Section 5. RDF-Compatible Model-Theoretic Semantics. Rapport technique. W3C. Consulté sur http://www.w3.org/TR/owl-semantics/rdfs.html\#built\_in\_vocabulary

Paulheim H. (2014). Identifying wrong links between datasets by multi-dimensional outlier detection. In Wodoom, p. 27–38.

Porter M. A., Onnela J.-P., Mucha P. J. (2009). Communities in networks. Notices of the AMS, vol. 56, no 9, p. 1082–1097.

Raad J., Beek W., Van Harmelen F., Pernelle N., Saïs F. (2018). Detecting erroneous identity links on the web using network metrics. In International semantic web conference, p. 391–407.

Raad J., Pernelle N., Saïs F. (2017). Detection of contextual identity links in a knowledge base. In Proceedings of the knowledge capture conference, p. 8.

Ronhovde P., Nussinov Z. (2009). Multiresolution community detection for megascale networks by information-based replica correlations. Physical Review E, vol. 80, no 1, p. 016109.

Rooij S. de, BeekW., Bloem P., Harmelen F. van, Schlobach S. (2016). Are names meaningful? quantifying social meaning on the semantic web. In International semantic web conference, p. 184–199.

Rosvall M., Bergstrom C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, vol. 105, no 4, p. 1118–1123.

Valdestilhas A., Soru T., Ngomo A.-C. N. (2017). Cedal: time-efficient detection of erroneous links in large-scale link repositories. In International conference on web intelligence, p. 106–113.

Xie J., Szymanski B. K. (2011). Community detection using a neighborhood strength driven label propagation algorithm. In Proceedings of the 2011 ieee network science workshop, p. 188–195.

Yang Z., Algesheimer R., Tessone C. (2016). A comparative analysis of community detection algorithms on artificial networks. Scientific reports, vol. 6, p. 30750.