About DACE-DL
DAta-CEntric AI-driven Data Linking (DACE-DL)
In a world of open science and FAIR data, Linked Data have been gaining popularity, due to the means they offer for (meta)data exchange, federation and sharing on the Web. Data linking is the scientific challenge of automatically establishing typed links between the entities of two or more structured datasets. A variety of complex data linking systems exists, evaluated on public benchmarks. While they have allowed for the generation of vast amounts of linked data, data generic solutions often have limited applicability in real-world scenarios where data are highly heterogeneous and very domain-specific. DACE-DL targets a paradigm shift with a data-centric bottom-up methodology relying on machine learning models. From current and previous linked data projects (in music, encyclopedia, agronomy, biodiversity), we will study what AI can learn from already available linked data and modularized existing data linking systems to re-inject this knowledge into future Web-data challenges.
Objectives
DACE-DL’s main objective is to lay the methodological grounds towards the realisation of a new vision of data-centric and bottom-up data linking, formulated as the automatic identification via ML and RL techniques of the linking problem types of two knowledge graphs, followed by the application of (a combination of) atomic solutions that are best fit for the knowledge graphs at hand. We focus on the following specific objectives:
- Obj. 1: To provide approaches for the learning and generation of joint datasets profile features, which will enable the training and validation of ML models for data linking, leaning upon the wealth of existing linked data and modularized data linking tools.
- Obj. 2: To provide proof-of-concept solutions in the following scenarios: (i) academic benchmark data (the OAEI campaign); (ii) encyclopedic data (DBpedia and YAGO); (iii) specific application domains from the results of our past and ongoing projects: AgroLD (plant biology) [1], ANR D2KAB (2019-2023) (agronomy & biodiversity) and ANR DOREMUS (2014-2018) (cultural heritage), as well as recently published datasets on Covid-19 [2]. We will hence reaffirm the legacy of past ANR-funded data linking projects, re-validating their results by applying the novel methods developed in Obj. 1.
- Obj. 3: To make available to the community (i) the analyses of the LPTs, (ii) the modularization of data linking tools and (iii) the correspondences between (i) and (ii) validated on the proof-of-concept solutions, thus facilitating the development of explainable AI-based data linking tools in the future.
[1] Venkatesan, A., (...), Jonquet, C. et al. Agronomic Linked Data (AgroLD): A knowledge-based system to enable integrative biology in agronomy. PloS one, 2018.
[2] Michel, F, (...), Winckler, M.: Covid-on-the-Web: Knowl. Graph and Services to Advance COVID-19 Research. ISWC, 2020.
[3] Todorov, K. (2019). Datasets First! A Bottom-up Data Linking Paradigm. In ISWC (Satellites) (pp. 338-342).
[4] Ben Ellefi, M., Bellahsene, Z., Breslin, J. G., Demidova, E., Dietze, S., Szymański, J., & Todorov, K. (2018). RDF dataset profiling–a survey of features, methods, vocabularies and applications. Semantic Web, 9(5), 677-705.
[5] Achichi, M., Bellahsene, Z., Ellefi, M. B., & Todorov, K. (2019). Linking and disambiguating entities across heterogeneous RDF graphs. Journal of Web Semantics, 55, 108-121.