A journey through large climate files
ar - An ML-ready, unified (surface & atmospheric) version of the data in Zarr. (Analysis Ready) co - A port of gaussian-gridded ERA5 data to Zarr. (Cloud Optimized) raw - All raw grib & NetCDF data.
raven - Development of an ERA5 cloud storage for efficient access #396
https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11 https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935 https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685 https://medium.com/pangeo/using-kerchunk-with-uncompressed-netcdf-64-bit-offset-files-cloud-optimized-access-to-hycom-ocean-9008ba6d0d67 https://www.coiled.io/
XARRAY
DASK
Parallel Programming in Climate and Weather
climtas: Climate Timeseries Analysis Climtas is a package for working with large climate analyses. It focuses on the time domain with custom functions for Xarray and Dask data.
grid metrics (vorticity, divergent, etc)
Parallelizing Xarray with Dask
Unidata Chunking Data: Why it Matters
With a conventional contiguous (index-order) storage layout, the time dimension varies most slowly, y varies faster, and x varies fastest. In this case, the spatial access is fast (0.013 sec) and the time series access is slow (180 sec, which is 14,000 times slower). If we instead want the time series to be quick, we can reorganize the data so x or y is the most slowly varying dimension and time varies fastest, resulting in fast time-series access (0.012 sec) and slow spatial access (200 sec, 17,000 times slower). In either case, the slow access is so slow that it makes the data essentially inaccessible for all practical purposes, e.g. in analysis or visualization.
Handling very large files in Python
When your netCDF file becomes large, it is unlikely you can fit the entire file into your laptop memory. You can slice your dataset and load it so it can fit into your laptop memory. You can slice netCDF variables using a syntax similar to numpy arrays:
The elasped time spent to extract your slice strongly depends on how your data has been stored (how the dimensions are organized):
Intel(R) Xeon(R) w5-3425 3.19 GHz 12 cores 64 GB RAM
Climatology
Calculating Climatologies and Anomalies with Xarray and Dask:
Optimizing climatology calculation with Xarray and Dask
Strategies for climatology calculations
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?
Global Mean Surface Temperature
Analysis ready kerchunk
Using AWS Lambda and PyWren for Landsat 8 Time Series
Accessing NetCDF and GRIB file collections as cloud-native virtual datasets using Kerchunk
Cloud-Performant NetCDF4/HDF5 with Zarr, Fsspec, and Intake
Julia
julia: What would be the best approach to handle large NetCDF sets?
Extracting data from netcdf stacked file
ClimateTools.jl Importing a NetCDF dataset
ClimateUtilities.jl reading NC
Python:
Earthkit - ECMWF Python library
Parallelizing Xarray with Dask from NCAR
Speeding up reading of very large netcdf file in python
I highly recommend that you take a look at the xarray and dask projects. Using these powerful tools will allow you to easily split up the computation in chunks. This brings up two advantages: you can compute on data which does not fit in memory, and you can use all of the cores in your machine for better performance. You can optimize the performance by appropriately choosing the chunk size (see documentation).
How to select an inter-year period with xarray?
xarray - select the data at specific x AND y coordinates
Subtract two xarrays while keeping all dimensions
Resample xarray object to lower resolution spatially
Python: How to write large netcdf with xarray
Visualization
https://projectpythia.org/advanced-viz-cookbook/notebooks/1-comparison.html