Skip to content

rosazwart/phenopackets-v2-rdf-schema

Repository files navigation

RDF Schemas for Phenopackets Version 2

In this repository, SHACL is used to specify the structure of RDF datasets complying to GA4GH Phenopackets Version 2 and interoperable with the CARE-SM semantic model. The SHACL files are based on the ShEx (Shape Expression) files representing this RDFied phenopackets version 2 data model found here.

To represent common predicates, terms from the ontology Semanticscience Integrated Ontology (SIO) are used. For some common predicates and those originating from biomedical domain knowledge other ontologies have been added such as the NCI Thesaurus, Human Phenotype Ontology and Genotype Ontology. These ontologies are interoperable given the Open Biological and Biomedical Ontology Foundry (OBO).

For representing predicates related to metadata, the terms defined in the DCMI Metadata Terms (dcterms) maintained by Dublin Core Metadata Initiative are used.

Workflow of generating RDF data from any given data in another format

To avoid manually converting JSON data to an RDF knowledge graph complying to the GA4GH Phenopackets schema from scratch, a workflow has been set up that facilitates this action as shown below. The workflow consists of multiple steps that will be discussed in the next subsections.

Worfklow

1. Modelling Phenopackets Schema

For modelling the phenopackets schema, Shapes Constraint Language (SHACL) files have been written that describe how a dataset needs to be structured. To keep the constraints organized, each file stores the shape of a single class or the shapes of similar classes. Shapes are the rules to which the instances of a class need to conform to.

2. Generating YARRRML and example JSON file(s) given SHACL files

A script has been developed that generates a YARRRML file containing the maximum requirements of the data structure following the Phenopackets schema. The generator script can be found in folder shacl2yarrrrml. Along the generation of a YARRRML file, one or multiple (multiple "root" nodeshapes defined in SHACL results in multiple JSON files) JSON files are created that show how the data that is to be converted should look like.

3. Aligning data to JSON file

The JSON file(s) show(s) the structure that the YARRRML file will accept and convert correctly to RDF. This structure is mainly built upon indexes that link one data field to another. This allows for robustness of the RDF conversion whatever SHACL structure has been given. Each datafield in the generated JSON file(s) contains a comment to show the user whether the datafield is needed in order to comply to the data model represented by the SHACL files.

In folder phenopacketv2_jsonaligner a script can be found that aligns JSON data following the phenopacket version 2 structure to the JSON data structure that is compatible with the generated YARRRML.

4. RML Mapping

A browser-based IDE Matey can be used to generate RDF triples given your data and YARRRML rules. To enable the automation of this process, in folder yarrrml2rdf the script is stored that visits the browser-based IDE, enters the input data files as well as the generated YARRRML rules to then generate the RDF triples that are stored in a Turtle file.

5. RDF Validation

To ensure that the resulting RDF knowledge graph still complies to the Phenopackets data model, a script has been written that validates a given RDF file against the SHACL files containing all the class shapes. When the RDF data does not conform to the intended data structure, it will output a report on which instances throw errors. The validator script can be found in folder rdfvalidator.

Used Libraries

Functionalities in the scripts included throughout the workflow are used from the open source Python library RDFLib and the additional module pySHACL. Also, for writing the YARRRML template the package ruamel.yaml is utilized. For automating a series of actions in the browser-based IDE Matey the package selenium is used.

Used Phenopacket Data

In folder example-hamlet mock HAMLET analysis data has been converted to an RDF dataset.

In folder example-phenopacket a selection of phenopacket instances has been extracted from the Monarch Initiative phenopacket store and converted to RDF datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages