In this repository, SHACL is used to specify the structure of RDF datasets complying to GA4GH Phenopackets Version 2 and interoperable with the CARE-SM semantic model. The SHACL files are based on the ShEx (Shape Expression) files representing this RDFied phenopackets version 2 data model found here.
To represent common predicates, terms from the ontology Semanticscience Integrated Ontology (SIO) are used. For some common predicates and those originating from biomedical domain knowledge other ontologies have been added such as the NCI Thesaurus, Human Phenotype Ontology and Genotype Ontology. These ontologies are interoperable given the Open Biological and Biomedical Ontology Foundry (OBO).
For representing predicates related to metadata, the terms defined in the DCMI Metadata Terms (dcterms) maintained by Dublin Core Metadata Initiative are used.
To avoid manually converting JSON data to an RDF knowledge graph complying to the GA4GH Phenopackets schema from scratch, a workflow has been set up that facilitates this action as shown below. The workflow consists of multiple steps that will be discussed in the next subsections.
For modelling the phenopackets schema, Shapes Constraint Language (SHACL) files have been written that describe how a dataset needs to be structured. To keep the constraints organized, each file stores the shape of a single class or the shapes of similar classes. Shapes are the rules to which the instances of a class need to conform to.
A script has been developed that generates a YARRRML file containing the maximum requirements of the data structure following the Phenopackets schema. The generator script can be found in folder shacl2yarrrrml
. Along the generation of a YARRRML file, one or multiple (multiple "root" nodeshapes defined in SHACL results in multiple JSON files) JSON files are created that show how the data that is to be converted should look like.
The JSON file(s) show(s) the structure that the YARRRML file will accept and convert correctly to RDF. This structure is mainly built upon indexes that link one data field to another. This allows for robustness of the RDF conversion whatever SHACL structure has been given. Each datafield in the generated JSON file(s) contains a comment to show the user whether the datafield is needed in order to comply to the data model represented by the SHACL files.
In folder phenopacketv2_jsonaligner
a script can be found that aligns JSON data following the phenopacket version 2 structure to the JSON data structure that is compatible with the generated YARRRML.
A browser-based IDE Matey can be used to generate RDF triples given your data and YARRRML rules. To enable the automation of this process, in folder yarrrml2rdf
the script is stored that visits the browser-based IDE, enters the input data files as well as the generated YARRRML rules to then generate the RDF triples that are stored in a Turtle file.
To ensure that the resulting RDF knowledge graph still complies to the Phenopackets data model, a script has been written that validates a given RDF file against the SHACL files containing all the class shapes. When the RDF data does not conform to the intended data structure, it will output a report on which instances throw errors. The validator script can be found in folder rdfvalidator
.
Functionalities in the scripts included throughout the workflow are used from the open source Python library RDFLib and the additional module pySHACL. Also, for writing the YARRRML template the package ruamel.yaml is utilized. For automating a series of actions in the browser-based IDE Matey the package selenium is used.
In folder example-hamlet
mock HAMLET analysis data has been converted to an RDF dataset.
In folder example-phenopacket
a selection of phenopacket instances has been extracted from the Monarch Initiative phenopacket store and converted to RDF datasets.