A Survey on web data linking

A Survey on web data linking

Manel Achichi Zohra Bellahsene Konstantin Todorov

LIRMM / University of Montpellier, France

Corresponding Author Email: 
31 December 2016
| Citation

Data are being published continuously on the web in a decentralized mannerleading to a web of heterogeneous data. Given the large amount of published data, access torelevant information becomes difficult, hence the need to interconnect these data. In thispaper, we propose a survey on approaches and tools addressing the data linking problem.The particularity of this survey is that we consider the linking processes as a pipelinecomposed of pre-processing, main matching and post-processing phases and we review thedifferent techniques applied on each of these three steps in service of the global linking task.The actual task of linking two data instances is certainly at the core of this process; however,what happens before and what happens after this task is performed, is of crucial importancefor the effectiveness and the efficiency of a data linking tool. One of the importantcontributions of this paper lies in the organization of the approaches and tools in a (pseudo-)taxonomy, with respect to the three major steps of the matching process (pre-processing, datamatching and post-processing), splitting them further into several categories according to thetasks that each approach adresses and finally – according to the techniques that are applied.We additionally consider a fourth, multi-step category of methods – those that act on morethan one step of the matching process (they can be found, on multiple leaves of ourtaxonomy). Finally, we describe and compare different state-of-the-art approaches and toolsaccording to a set of criteria.


web of data, data linking, instance matching

1. Introduction
2. The Data Linking Problem
3. Review of Single-step Methods
4. Multi-Step Methods
5. Discussion and Comparison of the Tools and Approaches
6. Conclusion and Future Work

Atencia, M., David, J., and Scharffe, F. (2012, October). Keys and pseudo-keys detection for web datasets cleansing and interlinking. In International Conference on Knowledge Engineering and Knowledge Management (pp. 144-153). Springer Berlin Heidelberg.

Araujo, S., Hidders, J., de Vries, A. P., and Schwabe, D. (2011, October). Serimi-resource description similarity, rdf instance matching and interlinking. In Proceedings of the 6th International Conference on Ontology Matching-Volume 814 (pp. 246-247). CEUR-WS.org.

Choffé, Pierre and Leresche, Françoise. (2016). DOREMUS: Connecting Sources, Enriching Catalogues and User Experience.

Ferrara, A., Nikolov, A., and Scharffe, F. (2013). Data linking for the semantic web. Semantic Web: Ontology and Knowledge Base Enabled Tools, Services, and Applications, 169.

Guizol, L., Croitoru, M., and Leclere, M. (2013). Aggregation Semantics for Link Validity.. In Research and Development in Intelligent Systems XXX (pp. 359-372). Springer International Publishing.

Jaffri, A., Glaser, H., and Millard, I. (2008). Managing URI synonymity to enable consistent reference on the Semantic Web.. Proceedings of the 1st IRSW2008 International Workshop on Identity and Reference on the Semantic Web, Tenerife, Spain, June 2, 2008.

Kejriwal, M., and Miranker, D. P. (2015, May). Semi-supervised Instance Matching Using Boosted Classifiers.. In European SemanticWeb Conference (pp. 388-402). Springer International Publishing.

Lesnikova, T., David, J., and Euzenat, J. (2014, May). Interlinking English and Chinese RDF Data Sets Using Machine Translation.. In 3rd ESWC workshop on Knowledge discovery and data mining meets linked open data (Know@ LOD). No commercial editor.

Lesnikova, T., David, J., and Euzenat, J. (2015, September). Interlinking English and Chinese RDF Data Using BabelNet.. In Proceedings of the 2015 ACM Symposium on Document

Engineering (pp. 39-42). ACM. Matthew Rowe. Interlinking Distributed Social Graphs.. In Proceedings of the WWW2009 Workshop on Linked Data on theWeb, LDOW2009, Madrid, Spain, April 20, 2009., 2009.

George A. Miller. Wordnet: a lexical database for English.. Communications of the ACM, 38(11):39?41, 1995.

Navigli R., and Ponzetto S. P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network.. Artificial Intelligence, 193, 217-250.

Nikolov A., Uren V., Motta E., and De Roeck A. (2008, September). Integration of Se- mantically Annotated Data by the KnoFuss Architecture.. In International Conference on Knowledge Engineering and Knowledge Management (pp. 265-274). Springer Berlin Heidelberg.

Ngomo A. C. N., and Auer S. (2011). Limes-a time-efficient approach for large-scale link discovery on the web of data.. integration, 15, 3.

Nguyen, K., Ichise, R., and Le, B. (2012, November). SLINT: a schema-independent linked data interlinking system. In Proceedings of the 7th International Conference on Ontology Matching-Volume 946 (pp. 1-12). CEUR-WS. org.

Raimond, Y., Sutton, C., and Sandler, M. B. (2008). Automatic Interlinking of Music Datasets on the Semantic Web.. Automatic Interlinking of Music Datasets on the Semantic Web. LDOW, 369.

Rong, S., Niu, X., Xiang, E. W., Wang, H., Yang, Q., and Yu, Y. (2012, November). A machine learning approach for instance matching based on similarity metrics.. In International Semantic Web Conference (pp. 460-475). Springer Berlin Heidelberg.

Scharffe, F., Liu, Y., and Zhou, C. (2009). Rdf-ai: an architecture for rdf datasets matching, fusion and interlink.. In Proc. IJCAI 2009 workshop on Identity, reference, and knowledge representation (IR-KR), Pasadena (CA US).

Shao, C., Hu, L. M., Li, J. Z., Wang, Z. C., Chung, T., and Xia, J. B. (2016). RiMOM-IM: A Novel Iterative Framework for Instance Matching.. Journal of Computer Science and Technology, 31(1), 185-197.

Soru, T., Marx, E., and Ngonga Ngomo, A. C. (2015, May). ROCKER: A refinement operator for key discovery.. In Proceedings of the 24th International Conference onWorldWideWeb (pp. 1025-1033). ACM.

Symeonidou, D., Pernelle, N., and Saïs, F. (2011, October). Kd2r: A key discovery method for semantic reference reconciliation.. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems" (pp. 392-401). Springer Berlin Heidelberg.

Symeonidou, D., Armant, V., Pernelle, N., and Saïs, F. (2014, October). Sakey: Scalable almost key discovery in rdf data.. In International Semantic Web Conference (pp. 33-49). Springer International Publishing.

Volz, J., Bizer, C., Gaedke, M., and Kobilarov, G. (2009). Silk-A Link Discovery Framework for the Web of Data.. LDOW, 538.