Skip to content
Damien Irving edited this page Mar 4, 2022 · 29 revisions

The package assists with going from a forecast dataset to a collection of valid, independent samples for assessment.

Relevant code and packages

Relevant papers

Parallel processing on NCI

A local Dask cluster can be setup as follows:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)

You can set up a Dask cluster under the OpenOnDemand environment across multiple nodes using the cloud-based SLURM cluster.

from dask.distributed import Client,Scheduler
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=16, memory="31GB")
client = Client(cluster)
cluster.scale(cores=32)

You can also use climtas.nci.GadiClient() following these instructions.

Observational data

/g/data/xv83/observations/
/g/data/zv2/agcd/v2

CAFE data

Code for model run: https://github.com/csiro-dcfp/cm-forecasts

The raw netcdf files produced by the model are sent to tape when the model is run. To access the 2020-11 forecasts (for instance) you need to be a member of the v14 project:

$ mdss -P v14 dmls -l f6/c5-d60-pX-f6-20201101

Code for post-processing (Zarr conversion): https://github.com/csiro-dcfp/post-processing

The post-processed data is stored at /g/data/xv83/dcfp/CAFE-f6.

The control run that the simulation branched from (c5), data assimilation method (d60), perturbation method (pX) are indicated. They are the defining features of the f5 dataset.

The f5 dataset has 10 ensemble members and was run over a longer period of time, whereas f6 has 96 members over a shorter period.

Issues

(See a more complete summary at: https://cafef6.readthedocs.io/en/latest/notebooks/CAFE-f6_issues.html)

The CAFE monthly precipitation has listed units of kg/m2/s-1 but is actually the sum of the daily kg/m2/s-1 values.

In CAFE-60 (the reanalysis that is used to initialise the forecasts), a bunch of water was added to the ocean around 1990 (which has a trivial impact on most variables) and then the bias correction scheme changed for forecasts initialised from 1992 onwards (i.e. after satellite altimetry data came online). The scheme allowed an SST cold bias up until 1992 because there were so few observations to assimilate (and so the simulation would crash if you didn't allow a cold bias), but then didn't allow a bias to persist (or at least as much of a bias) after 1992. This causes a sudden warming-related step change in most surface and atmospheric variables. For depth integrated ocean quantities warming continues for many years as the SST step change propagates into the deep ocean but then is arrested (and begins to cool for a few years) as Argo observations come online in the mid-2000s (i.e. the sub-surface ocean was basically unconstrained until then).

The bias correction step change doesn't seem to impact global precipitation very much, but the model is very sensitive to volcanoes and thus pronounced rainfall reductions are associated with Pinatubo (1991) and El Chichón (1982).

A different version of the model executable file was used in the f6 dataset for the May forecasts from 2005 onwards than for the rest of the forecasts. This creates a bit of a zig-zag behaviour in many variables.

Processing steps

Step 1: Pre-processing

TODO:

  • I/O with netcdf
  • Temporal aggregation (e.g. annual, seasonal or monthly) is the tricky part of the preprocessing, because you need to decide how to handle/modify the lead_time coordinate
    • Resample frequencies
      • Y (annual, with date label being last day of year)
      • M (monthly, with date label being last day of month)
      • Q-NOV (DJF, MAM, JJA, SON, with date label being last day of season)
      • A-NOV (annual mean for Dec-Nov, date label being last day of the year)
      • YS (annual, with date label being first day of the year)
      • MS (monthly, with date label being first day of month)
      • QS-DEC (DJF, MAM, JJA, SON, with date label being first day of season)
      • AS-DEC (annual mean for Dec-Nov, with date label being first day of year)
  • xarray pad for NaN filling
  • ClimPred has functions for (and a good explanation of) verification alignment

Step 2: Fidelity testing

Returns a mask (to remove gird points that fail the test) or a test statistic/s for each grid point corresponding to different alpha levels (e.g. 0.99, 0.95) that a mask could be created from.

Distribution tests

Dougie's paper uses the Kolmogorov–Smirnov test statistic (the 1D function comes from SciPy, the 2D he wrote himself) to test the similarity between the observational and forecast distributions. Other papers apply the Multiple-moment test (1D only), or we could use scipy implementation of the Anderson-Darling test. It may be possible to use xclim for the KS test calculation once this issue is resolved, otherwise xks could be incorporated into this unseen package. For a 1D KS test there's a formula to determine the confidence interval. For the 2D test you need to use bootstrapping to find the confidence interval (i.e. you take many sub-samples of the forecast data - each sub-sample has the same size as the observations - and calculate a KS value for each sub-sample).

Need to calculate test scores for each lead time separately (because model can drift away from initial state).

Correlation tests

Because all the ensemble members start from the same point, they are also correlated (i.e. not independent) for n lead times. For instance, lead times < 30 months might have a correlation higher than some threshold, so you'd would exclude all lead times < 30 months from your analysis.

Issues

An issue is data selection where you want the same number of samples for each lead time (e.g. for a dataset where 10-year forecasts are initiated every year from 2005-2020, you don't get full overlap until 2014). Dougie has written a function stack_super_ensemble to subset the overlap data, but it's inefficient.

Step 3: The easier stuff

i.e. counting to determine the likelihood of a particular event.

Development notes

Design choices

Define functions that can be used to construct workflows in a Jupyter notebook or linked together in the main() function of Python scripts.

Patch extraction

A common operation involves taking observational datasets and converting the time axis to a dual initial date / lead time coordinate. This basically involves duplicating the data via overlapping windows, which is analogous to a process in image processing (to make images more blurry) called patch extraction. There are implementations available as part of skimage (skimage.util.view_as_windows), numpy (numpy.lib.stride_tricks.sliding_window_view), dask (dask.array.overlap.sliding_window_view) and xarray (xarray.core.rolling).

Data selection using multi-indexes

For many analyses we ultimately want to reduce all the non-spatial dimensions down to one single dimension. For example, this array,

ds_cafe

Dimensions:    (ensemble: 96, init_date: 4, lat: 17, lead_time: 3650, lon: 17)
Coordinates:
  * ensemble   (ensemble) int64 1 2 3 4 5 6 7 8 9 ... 88 89 90 91 92 93 94 95 96
  * init_date  (init_date) datetime64[ns] 1990-11-01 1991-11-01 ... 1993-11-01
  * lat        (lat) float64 -43.48 -41.46 -39.44 ... -15.17 -13.15 -11.12
  * lead_time  (lead_time) int64 0 1 2 3 4 5 6 ... 3644 3645 3646 3647 3648 3649
  * lon        (lon) float64 113.8 116.2 118.8 121.2 ... 146.2 148.8 151.2 153.8
    time       (lead_time, init_date) datetime64[ns] dask.array<chunksize=(3650, 4), meta=np.ndarray>
Data variables:
    precip     (init_date, lead_time, ensemble, lat, lon) float64 dask.array<chunksize=(1, 50, 96, 17, 17), meta=np.ndarray>

would simply become (lat, lon, sample), where sample is a MultiIndex with related ensemble, init_date and lead_time coordinate values.

ds_stacked = ds_cafe.stack({'sample': ['ensemble', 'init_date', 'lead_time']})
ds_stacked

Dimensions:    (lat: 17, lon: 17, sample: 1401600)
Coordinates:
  * lat        (lat) float64 -43.48 -41.46 -39.44 ... -15.17 -13.15 -11.12
  * lon        (lon) float64 113.8 116.2 118.8 121.2 ... 146.2 148.8 151.2 153.8
    time       (sample) datetime64[ns] dask.array<chunksize=(1401600,), meta=np.ndarray>
  * sample     (sample) MultiIndex
  - ensemble   (sample) int64 1 1 1 1 1 1 1 1 1 1 ... 96 96 96 96 96 96 96 96 96
  - init_date  (sample) datetime64[ns] 1990-11-01 1990-11-01 ... 1993-11-01
  - lead_time  (sample) int64 0 1 2 3 4 5 6 ... 3644 3645 3646 3647 3648 3649
Data variables:
    precip     (lat, lon, sample) float64 dask.array<chunksize=(17, 17, 14600), meta=np.ndarray>

We can index (i.e. select and slice) the data using references to ensemble, init_date, lead_time, but not time, which is a bit of a limitation because the work around for time selection involves creating a boolean array the same size as the data array.

I think this limitation will be overcome with the implementation of flexible indexes. A proposal has been developed for adding this feature and work has started as part of the xarray CZI grant.

Making functions dask-aware

Use apply_ufunc. Here's some examples: