BIG

A journey through large climate files

ARCO ERA5

ar - An ML-ready, unified (surface & atmospheric) version of the data in Zarr. (Analysis Ready) co - A port of gaussian-gridded ERA5 data to Zarr. (Cloud Optimized) raw - All raw grib & NetCDF data.

raven - Development of an ERA5 cloud storage for efficient access #396

https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11 https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935 https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685 https://medium.com/pangeo/using-kerchunk-with-uncompressed-netcdf-64-bit-offset-files-cloud-optimized-access-to-hycom-ocean-9008ba6d0d67 https://www.coiled.io/

Cookbooks

XARRAY

Xarray Tutorial

DASK

Dask Resources

Parallel Programming in Climate and Weather

climtas: Climate Timeseries Analysis Climtas is a package for working with large climate analyses. It focuses on the time domain with custom functions for Xarray and Dask data.

grid metrics (vorticity, divergent, etc)

Parallelizing Xarray with Dask

Notebook w/ some dask info

Using dask with ECMWF API

Dask Arrays with Xarray

Unidata Chunking Data: Why it Matters

With a conventional contiguous (index-order) storage layout, the time dimension varies most slowly, y varies faster, and x varies fastest. In this case, the spatial access is fast (0.013 sec) and the time series access is slow (180 sec, which is 14,000 times slower). If we instead want the time series to be quick, we can reorganize the data so x or y is the most slowly varying dimension and time varies fastest, resulting in fast time-series access (0.012 sec) and slow spatial access (200 sec, 17,000 times slower). In either case, the slow access is so slow that it makes the data essentially inaccessible for all practical purposes, e.g. in analysis or visualization.

Handling very large files in Python

When your netCDF file becomes large, it is unlikely you can fit the entire file into your laptop memory. You can slice your dataset and load it so it can fit into your laptop memory. You can slice netCDF variables using a syntax similar to numpy arrays:

The elasped time spent to extract your slice strongly depends on how your data has been stored (how the dimensions are organized):

Intel(R) Xeon(R) w5-3425 3.19 GHz 12 cores 64 GB RAM

Climatology

Calculating Climatologies and Anomalies with Xarray and Dask:

Optimizing climatology calculation with Xarray and Dask

Strategies for climatology calculations

Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

Global Mean Surface Temperature

Analysis ready kerchunk

pywren.io

Kerchunk Tutorial 2022-04-25

Using AWS Lambda and PyWren for Landsat 8 Time Series

Accessing NetCDF and GRIB file collections as cloud-native virtual datasets using Kerchunk

Cloud-Performant NetCDF4/HDF5 with Zarr, Fsspec, and Intake

Julia

julia: What would be the best approach to handle large NetCDF sets?

Extracting data from netcdf stacked file

NCDatasets.jl

YAXArrays.jl

Rasters.jl

ClimateTools.jl Importing a NetCDF dataset

ClimateUtilities.jl reading NC

JuliaClimate Nootebooks

ShallowWater.jl

Python:

XARRAY Advanced Indexing

More Advanced Indexing

Python Training from Unidata

Python Gallery from Unidata

Metpy getting started

Metpy easy plots

Earthkit - ECMWF Python library

Parallelizing Xarray with Dask from NCAR

Speeding up reading of very large netcdf file in python

I highly recommend that you take a look at the xarray and dask projects. Using these powerful tools will allow you to easily split up the computation in chunks. This brings up two advantages: you can compute on data which does not fit in memory, and you can use all of the cores in your machine for better performance. You can optimize the performance by appropriately choosing the chunk size (see documentation).

How to select an inter-year period with xarray?

xarray - select the data at specific x AND y coordinates

Subtract two xarrays while keeping all dimensions

Resample xarray object to lower resolution spatially

Python: How to write large netcdf with xarray

Visualization

https://projectpythia.org/advanced-viz-cookbook/notebooks/1-comparison.html