A tool designed to convert IMOS NetCDF and CSV files into Cloud Optimised formats such as Zarr and Parquet
Visit the documentation on ReadTheDocs for detailed information.
- Conversion of CSV/NetCDF to Cloud Optimised format (Zarr/Parquet)
- YAML configuration approach with parent and child YAML configuration if multiple dataset are very similar (i.e. Radar ACORN, GHRSST, see config)
- Generic handlers for most dataset (GenericParquetHandler, GenericZarrHandler).
- Specific handlers can be written and inherits methods from a generic handler (Argo handler, Mooring Timseries Handler)
- Clustering capability:
- Local dask cluster
- Remote Coiled cluster
- driven by configuration/can be easily overwritten
- Zarr: gridded dataset are done in batch and in parallel with xarray.open_mfdataset
- Parquet: tabular files are done in batch and in parallel as independent task, done with future
- Reprocessing:
- Zarr,: reprocessing is achieved by writting to specific regions with slices. Non-contigous regions are handled
- Parquet: reprocessing is done via pyarrow internal overwritting function, but can also be forced in case an input file has significantly changed
- Chunking:
- Parquet: to facilitate the query of geospatial data, polygon and timestamp slices are created as partitions
- Zarr: done via dataset configuration
- Metadata:
- Parquet: Metadata is created as a sidecar _metadata.parquet file
- Unittesting of module: Very close to integration testing, local cluster is used to create cloud optimised files
Requirements:
- Python >= 3.10.14
- AWS SSO to push files to S3
- An account on Coiled for remote clustering (Optional)
curl -s https://raw.githubusercontent.com/aodn/aodn_cloud_optimised/main/install.sh | bash
Otherwise, go to the release page.
Notebooks can directly be imported into Google Colab
You can also click on the binder button right below to spin the environment and execute the notebooks (note that Binder is free with limited resources)