The main objective of Datalift is to bootstrap semantic lifting of raw data on the Web. A number of sub-objectives to reach this main objective have been identified: first it is necessary to provide ontologies that will allow providers to describe the data. We will investigate and develop methods and tools allowing data providers to select ontologies relevant to the description of their data. It is then necessary to convert the raw data into RDF conformant to the selected ontology. Datalift will investigate and develop an integrated suite of tools that will facilitate the conversion process for a large variety of source data formats (relational, XML, spreadsheets, microformats, and various metadata formats). The power of linked data will come from the quantity and quality of links between resources. Datalift will perform research on this topic with the goal of automating the interlinking process. Datalift will also research on novel methods for attaching licenses to linked data. Integrating all those technologies together, the Datalift platform will provide a complete process for lifting semantic data with the objective to become the reference platform for semantic data lifting. Datalift will finally publish data through a network of data providers with the objective of constituting the interlinked Web database sufficient to develop innovative applications and bring other data providers to lift their data.
Scientific and technical bottlenecks
In order to see the Web of data emerge, it is necessary to provide methods and tools all along the semantic lifting process. While some tools exist at different levels of the process, a comprehensive platform providing automation to the process is still to be developed. Ontology search, ontology quality metrics and ontology similarity measures are three disconnected areas of research. Datalift will tackle the problem of connecting those three areas in order to provide efficient ontology selection methods. As far as data interlinking is concerned, Datalift targets the full automation of the process with expert validation of the results. Analysis of current semi-automatic methods show three problems to overcome: providing ontology alignments in case of heterogeneous ontologies, determining key properties allowing to uniquely identify resources to be matched, and identifying similarity measures to use in order to compare these properties. Attaching licensing and provenance information in order to enable right management in RDF requires to extend the syntax and semantics of the formalism. It also requires to extend query mechanisms in order to be able to retrieve and follow this information.
Datalift will provide a catalog of ontologies facilitating the data providers’ task of selecting ontologies relevant to the data to publish. This catalog will feature concept search, ontology quality, and ontology similarity metrics. Datalift will also provide a data conversion suite that will enable efficient semi-automatic conversion of the raw data to RDF. This suite of tools will intelligently integrate together many data conversion tools and be able to automatically select the relevant tool for the type of data source to convert. Datalift aims at providing a suite of tools for automatic Web data interlinking. Datalift will also conduct a large scale interlinking experiment with the platform’s content providers data and with other datasets. An infrastructure for storing and accessing data will be provided, together with a suite of tools allowing to navigate and to interact with the linked data resulting from the platform lifting process. By extending semantic Web description and querying formalisms to manage licenses and rights, we expect to overcome one of the major obstacles for the publication of data by data providers. Having license information will allow them to keep the credits on their data.