Name	Name	Last commit message	Last commit date
Latest commit History 61 Commits
doc	doc
examples	examples
resources	resources
src/table2qb	src/table2qb
test	test
.gitignore	.gitignore
CHANGELOG.md	CHANGELOG.md
LICENSE	LICENSE
README.md	README.md
project.clj	project.clj

table2cube

This project transforms tables of observations and reference data into rdf data cube resources specified as csvw.

The pipeline

Observation Data

The observation input table should be arranged as tidy-data e.g. one row per observation, one column per component (i.e. dimension, attribute or measure). The output is a set of csvw documents - i.e. csv with json-ld metadata - that can be translated into RDF via a csv2rdf processor. The outputs that make up the cube are:

observations.csv: this goes through some transformations to standardise the cell values from arbitrary strings to slugs or other notations that are ready to be combined into URIs
component-specifications.csv: this is a normalisation of the observations table that has one row per component
dataset.json: the qb:DataSet
data-structure-definition.json: the qb:DataStructureDefinition
component-specifications.json: the set of qb:ComponentSpecifications (one per dimension, attribute and measure in the input)
observations.json: the set of qb:Observations (one per row in the input)

We also provide a set of skos:ConceptSchemes enumerating all of the codes used in each of the componentss (via used-codes-scheme.json and used-codes-concepts.json). These are useful for navigating within a cube by using the marginals - in other words this saves you from having to scan through all of the observations in order to establish the extent of the cube.

Reference Data

The project provides pipelines for preparing reference data. These can be used for managing reference data across multiple qb:DataSets.

Components: given a tidy-data input of one component per row, this pipeline creates a components.csv file and a components.json for creating qb:ComponentPropertys in an owl:Ontology. Note that components are the dimensions, attributes and measures themselves whereas the component-specifications are what links these to a given data-structure-definition.
Codelists: given a tidy-data input of one code per row, this pipeline creates a codelist.csv file and a codelist.json for creating skos:Concepts in an skos:ConceptScheme. Note that these codelists describe the universal set of codes that may be the object of a component (making it a qb:CodedProperty) not the (sub)set that have been used within a cube.

Configuration

The table2qb pipeline is configured with a dataset describing the columns it can expect to find in the input csv files. We'll currently provide this as a columns.csv file although we may later store/ retreive this from the database.

The dataset should have the following columns:

title - a human readable title (like csvw:title) that will be provided in the (first) header row of the input
name - a machine-readable identifier (like csvw:name) used in uri templates
component_attachment - how the component in the column should be attached to the Data Structure Definition (i.e. one of qb:dimension, qb:attribute, qb:measure or nil)
property_template - the predicate used to attach the (cell) values to the observations
value_template - the URI template applied to the cell values
datatype - as per csvw:datatype, how the cell value should be parsed (typically string for everything except the value column which will be number)

This initial draft also includes several conventions in the code that ought to be generalised to configuration - particularly how cell values are slugged.

Example

The ./test/resources/ directory provides examples of the above inputs:

reference data:
- components: components.csv
- codelists: flow-directions.csv, sitc-sections.csv, and units.csv
observation data:
- input.csv

It is also premised on configuration in ./resources/columns.csv. This will need changing to support further examples. It should ultimately be extracted from the database such that adding components makes them available as columns that can be provided in observation csv.

You can get the demo working from the repl:

$ lein repl

(require 'table2qb.code)
(ns table2qb.core)

To serialise everything to a tmp dir call (serialise-demo). Alternatively you can go through the pieces one-at-a-time...

Build components ontology:

(components-pipeline
 "./test/resources/trade-example/components.csv"
 "./tmp")
 ;; => components.csv,components.json

Build codelists:

(codelist-pipeline
 "./test/resources/trade-example/flow-directions.csv"
 "./tmp" "Flow Directions" "flow-directions")
 ;; => flow-directions.csv,flow-directions.json
 
(codelist-pipeline
 "./test/resources/trade-example/sitc-sections.csv"
 "./tmp" "SITC Sections" "sitc-sections")
 ;; => sitc-sections.csv, sitc-sections.json

(codelist-pipeline
 "./test/resources/trade-example/units.csv"
 "./tmp" "Measurement Units" "measurement-units")
 ;; => measurement-units.csv, measurement-units.json

Buid the cube itself:

(data-pipeline
 "./test/resources/trade-example/input.csv"
 "./tmp" "Regional Trade" "regional-trade")
 ;; => component-specifications.csv, dataset.json, data-structure-definition.json, component-specifications.json
 ;; => observations.csv, observations.json, used-codes-codelists.json, used-codes-codes.json

Ultimately we'll translate this into linked-data using the csv2rdf library. For now there's some helper functions to call the RDF::Tabular csv2rdf translator using the rdf cli tool (you can get this with gem install linkeddata).

For the metadata (each should be loaded into PMD as a vocabulary):

(csv2rdf "./tmp" "components") ;;=> components.ttl
(csv2rdf "./tmp" "flow-directions") ;;=> flow-directions.ttl
(csv2rdf "./tmp" "sitc-sections") ;;=> sitc-sections.ttl
(csv2rdf "./tmp" "measurement-units") ;;=> measurement-units.ttl

For the cube (each can be loaded into one PMD Dataset that covers the whole cube):

(csv2rdf-qb "./tmp")
;;=> dataset.ttl, data-structure-definition.ttl, component-specifications.ttl
;;=> observations.ttl, used-codes-codelists.ttl, used-codes-codes.ttl

You'll also need the rdf-cube, sdmx, time and uk geo reference vocabularies (from here) to make this work.

License

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

table2cube

The pipeline

Observation Data

Reference Data

Configuration

Example

License

About

Releases 4

Packages

Contributors 9

Languages

License

Swirrl/table2qb

Folders and files

Latest commit

History

Repository files navigation

table2cube

The pipeline

Observation Data

Reference Data

Configuration

Example

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 9

Languages

Packages