Ro-Crate RDA maDMP Mapper
is a tool that defines mapping between RO-Crate and
RDA maDMP. It allows generating RO-Crate(s) from maDMP (1 to many) and
maDMP from RO-Crate(s) (many to 1).
DMPs can include multiple datasets. However, ro-crate include only one dataset. The dataset can include other nested datasets. It is assumed that most of the data for the dataset at the root (e.g. funding, contributors, authors,...) is equivalent to the data of DMP. Nested datasets will be included as elements of distributions.
RO-crates have to have @id for each property. Otherwise, the jsonld generated would have a wrong format. However, an identifier is not always present for entities of DMP. In case of absence of identifiers, it is assumed that the title is also the @id.
The website of ro-crate explicitly mentions that it is possible to use schema.org metadata to supplement RO-Crate. So, if there are attributes which are present in DMP and missing in ro-crates, they are accounted for by other Linked Data Vocabularies. (Example: cost)
ro-crate attribute | @type of ro-crate property | DMP attribute | Parent of DMP attribute | Comment/Assumption |
---|---|---|---|---|
contactPoint | ContactPoint | contact | dmp | It is assumed that the contact person for the DMP is the same as the contact person of the dataset. |
@id | ContactPoint | identifier | contact_id | |
ContactPoint | mbox | contact | ||
name | ContactPoint | name | contact | |
contactType | ContactPoint | type | contact | These attributes are not exactly equivalent. But, they are close enough. |
author/creator | Dataset | contributor | dmp | |
@id | person | identifier | contributor_id | |
affiliation | person | role | contributor | |
name | person | name | contributor | |
person | mbox | contributor | ||
@id | person | identifier | contributor_id | |
cost | Dataset | cost | dmp | In DMP, the cost represents a list of costs related to data management. However, the cost for ro-crate may not include all costs. |
costCurrency | cost | currency_code | cost | This is not explicitly mentioned in ro-crate website. But, cost properties can be found in jsonld context used for ro-crates. |
description | cost | description | cost | |
@id | cost | title | cost | |
value | cost | value | cost | |
Language | Dataset | language | dmp | |
description | Dataset | description | dataset | |
name | Dataset | title | dataset | |
identifier | Dataset | identifier | dataset_id | |
Dataset | hasPart | distribution | dataset | Ro-crates usually include a lot of sub-datasets. They usually include information about the main dataset like encoding and size. Therefore, they are accounted for as distributions. |
File | hasPart | distribution | dataset | Ro-crates usually include a lot of sub-Files. Sub-datasets have sub-Files. Deeply nested files will not be taken into consideration. This is because it does not aline with the concept of distribution. |
downloadUrl | DataDownload | download_url | distribution | |
contentUrl | DataDownload | access_url | distribution | |
endDate | DataDownload | available_until | distribution | |
contentSize | Dataset | byte_size | distribution | |
contentSize | Dataset/File | byte_size | distribution | |
encodingFormat | Dataset/File | format | distribution | |
name | contentLocation | geo_location | host | It is assumed that the name is a country. But, it is not always the case. But, it is better than losing th e information. |
title | RepositoryObject | title | host | |
description | RepositoryObject | description | host | |
@id | RepositoryObject | url | host | |
availability | RepositoryObject | availability | host | availability is not defined exactly the same for ro-crate and DMP in terms of the format of the inputed value. |
license | Dataset/File | license | distribution | |
@id | license | license_ref | license | |
identifier | license | license_ref | license | |
datePublished | Dataset | issued | dataset | |
keywords | Dataset | keyword | dataset | |
description | Organisation | description | project | |
name | Organisation | title | project | |
name | Organisation | title | project | |
endDate | Organisation | end | project | |
startDate | Organisation | start | project | |
@id | Grant | identifier | funder_id | |
@id | CreativeWork | identifier | metadata_standard_id | |
Language | CreativeWork | language | dmp | metadata |
description | CreativeWork | description | dmp | metadata |
DMP attribute | Comment |
---|---|
dmp | There is no direct mapping to dmp since ro-crate is an approach to package research data with their metadata. DMP concept considers the bigger picture. |
contact_id, contributor_id, dmp_id, dataset_id, hasPart, host, project, metadata_standard_id, funder_id | For these attributes, there isn't an equivalent attribute. But, their children have equivalent attributes. The equivalence is not needed and it is not classified as missing. |
type | "type" of "contact_id" and "contributor_id" have different meaning than "@type". There isn't an equivalent. |
created | The DMP creation should be at the time the DMP is generated. Therefore, it does not have an equivalent in ro-crates. |
modified | DMP creation date is the same as the modified date in this case. |
personal_data, preservation_statement, preservation_statement, security_and_privacy, data_access, storage_type, pid_system, certified_with, backup_type, backup__frequency, ethical_issues_description, ethical_issues_exist, ethical_issues_report, data_quality_assurance, sensitive_data | Almost all attributes which have to do with quality, privacy, ethics and security are missing in ro-crate and connot be translated. PS: accessMode in https://w3id.org/ro/crate/1.0/context is different from data_access. |
support_versioning | This specific attribute has no equivalent in the context. However, there is the possibility to include the specific versions with other attributes. |
funding_status | The definition provided in DMP official description does not match any of the descriptions for the ro-crate. |
identifier, description, title | These attributes have to be inputed by the user. It is wrong to assume that the DMP's identifier is the same as the dataset's identifier. The user should create a new identifier for the DMP. It is also not always logical to give the DMP the name of the dataset. |
ro-crate attribute | Comment |
---|---|
@context, @graph, conformsTo, about | There isn't an equivalent for these attributes because they are specific to jsonld. However, DMP does not follow jsonld schema. It follows json schema. When automatically generating a ro-crate, they are inserted as they can be programatically determined. When generating DMPs, they are ignored. |
hasPart, hasMember | DMPs do not have the equivalent of these attributes. However, they are accounted for by including their values in distributions. |
publisher, sameAs, temporalCoverage | These attributes are not included as explicitly in DMPs. |
CreateAction, UpdateAction | CreateAction and UpdateAction classes are there to model the contributions of Context Entities of type Person or Organization. This is not present in DMPs. |
latitude, longitude | Places are described more thouroughly in ro-crates. |
In order to map between both standards, the following command should be used:
python mapper.py -i <input_path> -o <output_path>
Where input_path
represents a path to the input folder (or file) representing an maDMP or a RO-Crate project and
output_path
represents a path where the generated files will be stored.
It is not required to explicitally define the mapping direction (maDMP to RO-Crate or RO-Crate to maDMP) since this is handled within the tool. x
Several maDMP and RO-Crate examples are also provided. They are structured in the following way:
├── madmp
│ ├── calculation-of-nice-sunny-days
│ ├── closed
│ ├── dataset-many
│ ├── funded-project
│ ├── life-expectancy-prediction
│ ├── long
│ ├── minimal-content
│ ├── multilayer-perceptron-on-hypothyroid
│ ├── swedish-motor-insurance
│ └── world-development-indicators
└── rocrate
├── drug_consumption
├── Glop_Pot
├── GTM
├── NursingResidentStuff
└── world_development_indicators_visualization
For maDMPs, 5 of the examples are taken from examples made by students as part of Data Stewardship at Vienna University of Technology. The other 5 are taken from the examples provided by the RDA-DMP-Common-Standard.
For RO-Crates, 3 examples are taken from the official RO-Crate website and the other 2 are especially created to test the coverage of the mapping using existing datasets.
A demo of the mapping between both standards can be executed using the following command:
python demo.py
This demo uses the content examples
folder in order to generate mapped files and store them in demo_files
.
In order to run the unit tests stored under tests
, pytest
can be used. To do so, navigate to tests
folder and run the following command pytest -vv
.
Unit tests are made to test the functionality of some methods within the different modules and are not testing the
mapping functionality.