This project transforms tabular input files in CSV format into RDF Turtle format. During the transformation the input files comprised of tables of observations and reference data get converted into rdf data cube resources specified as csvw.
In order to execute the full table2qb process, the following 3 pipelines should be run (in no particular order):
components-pipeline
codelist-pipeline
(run for each codelist CSV input file)cube-pipeline
UPDATE THIS ONCE WE'VE DONE ISSUES 47 and 45.
To run the components-pipeline
use the following command:
BASE_URI=your_domain java -jar target/table2qb-0.1.3-SNAPSHOT-standalone.jar exec components-pipeline --input-csv input_file --column-config config_file --output-file output_file
To run the codelist-pipeline
for each of the codelist files use the following command:
BASE_URI=your_domain java -jar target/table2qb-0.1.3-SNAPSHOT-standalone.jar exec codelist-pipeline --codelist-csv codelist_file --codelist-name codelist_name --codelist-slug codelist_slug --column-config config_file --output-file output_file
To run the cube-pipeline
use the following command:
BASE_URI=your_domain java -jar target/table2qb-0.1.3-SNAPSHOT-standalone.jar exec cube-pipeline --input-csv input_file --dataset-name dataset_name --dataset-slug dataset_slug --column-config config_file --output-file output_file
Input CSV file of the correct structure - contents must correspond to the choice of pipeline, as outlined below:
components-pipeline
uses the CSV defining the components (e.g.components.csv
) as input_filecodelist-pipeline
uses individual codelist CSV files as input_filescube-pipeline
uses the input observation data file ( e.g.input.csv
) as input_file
The config_file (described below) is used to determine how the data in the input_file is interpreted.
config_file (e.g. columns.csv
) defines the mapping between a column name and the relevant component and sets out any preparatory transformations and URI templates.
The output of the process: a single file as RDF in Turtle format.
Input CSV codelist file.
Name of the output codelist file in Turtle format. The value of codelist_name is used as the value of the dcterms:title
property of the created codelist.
Slug of the output codelist file.
Name of the output file in Turtle format. The value of dataset_name is used as the value of the dcterms:title
property of the created dataset.
Slug of the output file.
The observation input table should be arranged as tidy-data e.g. one row per observation, one column per component (i.e. dimension, attribute or measure). This should be done prior to running table2qb using your choice of tool/software.
The column headers and column contents in the observation input file correspond to the components listed in the column configuration file (e.g. columns.csv
). This means that column headers in the input CSV file (e.g. input.csv
) should match the values under the "Title" column in the column configuration file (e.g. columns.csv
).
As mentioned above, components are the dimensions, attributes and measures present in the input observation data. The CSV components file (e.g. components.csv
) should contain a list of those components from the input file that you want to define vocabularies for. If there are external vocabularies that are reused for certain components, then they don't need to be listed here.
The CSV file should have the following columns:
Label
- a human readable label that corresponds to the column heading in the input observation data fileDescription
- a human readable description of the componentComponent Type
- what type of component this is - i.e. a Measure, Dimension or AttributeCodelist
- a link to the full codelist of all possible codes for a given codelist
In addition to the list of components, this file should also contain the unique values of the Measure Type
dimension. For example, if the Measure Type
dimension contains values of "Count" and "Net Mass", then both of them should be added to the components list.
Codelists describe the universal set of codes that may be the object of a component, not just the (sub)set that has been used within a cube. In other words, all the possible codes for a given codelist should be listed in the input CSV file for each codelist.
Each codelist CSV file should have the following columns:
Label
- a code from the codelist, there should be one row per codeNotation
- a slug ofLabel
, e.g.3 Mineral Fuels
becomes3-mineral-fuels
Parent Notation
- a slug of the parent code for the given code value
The table2qb pipeline is configured with a CSV file (e.g. columns.csv
) describing the components it can expect to find in the observation data file. The location of this file is specified with the --column-config
parameter.
The CSV file should have the following columns:
title
- a human readable title that will be provided in the (first) header row of the inputname
- a machine-readable identifier used in uri templatescomponent_attachment
- how the component in the column should be attached to the Data Structure Definition (i.e. one ofqb:dimension
,qb:attribute
,qb:measure
or nil)property_template
- the predicate used to attach the (cell) values to the observationsvalue_template
- the URI template applied to the cell values. The argument inside the {} should match the corresponding value of thename
column.datatype
- how the cell value should be parsed (typicallystring
for everything except the value column which will benumber
)value-transformation
- possible values here are:slugize
,unitize
or blank.slugize
converts column values into URI components, e.g.(slugize "Some column value")
issome-column-value
.unitize
works similar toslugize
and in addition also translates literal£
intoGBP
, e.g.(unitize "£ 100")
isgpb-100
.
The configuration file should contain one row per component, as well as one row per each unique value of the Measure Type
dimension (as above).
The ./examples/employment directory provides a full example and instructions for running it.
table2qb is written in Clojure, and so Leiningen should be installed to enable Clojure code to run.
The following version of Java is recommended: Java 8
Prior to running table2qb for the first time, a new jar file should be created by running the following command in the root folder of the project: lein uberjar
Copyright © 2018 Swirrl IT Ltd.
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.