Store Model Inputs and Outputs¶
Simulation inputs and/or outputs can be kept in memory or saved on disk using either xarray’s or zarr’s I/O capabilities.
Using xarray¶
The xarray.Dataset
structure, used for both simulation inputs and
outputs, already supports serialization and I/O to several file formats, among
which netCDF is the recommended format. For more information, see Section
reading and writing files in xarray’s docs.
Before showing some examples, let’s first create the same initial setup than the one used in Section Setup and Run Models:
In [1]: advect_model
Out[1]:
<xsimlab.Model (4 processes, 5 inputs)>
grid
spacing [in] uniform spacing
length [in] total length
init
loc [in] location of initial pulse
scale [in] scale of initial pulse
advect
v [in] () or ('x',) velocity
profile
In [2]: import xsimlab as xs
In [3]: ds = xs.create_setup(
...: model=advect_model,
...: clocks={
...: 'time': np.linspace(0., 1., 101),
...: 'otime': [0, 0.5, 1]
...: },
...: main_clock='time',
...: input_vars={
...: 'grid': {'length': 1.5, 'spacing': 0.01},
...: 'init': {'loc': 0.3, 'scale': 0.1},
...: 'advect__v': 1.
...: },
...: output_vars={
...: 'grid__x': None,
...: 'profile__u': 'otime'
...: }
...: )
...:
In [4]: ds
Out[4]:
<xarray.Dataset>
Dimensions: (otime: 3, time: 101)
Coordinates:
* time (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
* otime (otime) float64 0.0 0.5 1.0
Data variables:
grid__length float64 1.5
grid__spacing float64 0.01
init__loc float64 0.3
init__scale float64 0.1
advect__v float64 1.0
Attributes:
__xsimlab_output_vars__: grid__x
You can save the dataset here above, e.g., using xarray.Dataset.to_netcdf()
In [5]: ds.to_netcdf("advect_model_setup.nc")
You can then reload this setup later or elsewhere before starting a new simulation:
In [6]: import xarray as xr
In [7]: in_ds = xr.load_dataset("advect_model_setup.nc")
In [8]: out_ds = in_ds.xsimlab.run(model=advect_model)
In [9]: out_ds
Out[9]:
<xarray.Dataset>
Dimensions: (otime: 3, time: 101, x: 150)
Coordinates:
* otime (otime) float64 0.0 0.5 1.0
* time (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
* x (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
Data variables:
advect__v float64 1.0
grid__length float64 1.5
grid__spacing float64 0.01
grid__x (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
init__loc float64 0.3
init__scale float64 0.1
profile__u (otime, x) float64 0.0001234 0.0002226 ... 0.03916 0.02705
The latter dataset out_ds
contains both the inputs and the outputs of this
model run. Likewise, You can write it to a netCDF file or any other format
supported by xarray, e.g.,
In [10]: out_ds.to_netcdf("advect_model_run.nc")
Using zarr¶
When xarray.Dataset.xsimlab.run()
is called, xarray-simlab uses the zarr
library to efficiently store (i.e., with compression) both simulation input and
output data. Input data is stored before the simulation starts and output data
is stored progressively as the simulation proceeds.
By default all this data is saved into memory. For large amounts of model I/O data, however, it is recommended to save the data on disk. For example, you can specify a directory where to store it:
In [11]: out_ds = in_ds.xsimlab.run(model=advect_model, store="advect_model_run.zarr")
You can also store the data in a temporary directory:
In [12]: import zarr
In [13]: out_ds = in_ds.xsimlab.run(model=advect_model, store=zarr.TempStore())
Or you can directly use zarr.group()
for more options, e.g., if you want
to overwrite a directory that has been used for old model runs:
In [14]: zg = zarr.group("advect_model_run.zarr", overwrite=True)
In [15]: out_ds = in_ds.xsimlab.run(model=advect_model, store=zg)
Note
The zarr library provides many storage alternatives, including support for
distributed/cloud and database storage systems (see storage alternatives
in zarr’s tutorial). Note, however, that a few alternatives won’t work well
with xarray-simlab. For example, zarr.ZipStore
doesn’t support
feeding a zarr dataset once it has been created.
Regardless of the chosen alternative, xarray.Dataset.xsimlab.run()
returns
a xarray.Dataset
object that contains the data (lazily) loaded from the zarr
store:
In [16]: out_ds
Out[16]:
<xarray.Dataset>
Dimensions: (otime: 3, time: 101, x: 150)
Coordinates:
* otime (otime) float64 0.0 0.5 1.0
* time (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
* x (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
Data variables:
advect__v float64 1.0
grid__length float64 1.5
grid__spacing float64 0.01
grid__x (x) float64 dask.array<chunksize=(150,), meta=np.ndarray>
init__loc float64 0.3
init__scale float64 0.1
profile__u (otime, x) float64 dask.array<chunksize=(3, 150), meta=np.ndarray>
Zarr stores large multi-dimensional arrays as contiguous chunks. When opened as
a xarray.Dataset
, xarray keeps track of those chunks, which enables efficient
and parallel post-processing via the dask library (see Section parallel
computing with Dask in xarray’s docs).
Advanced usage¶
Encoding options¶
It is possible to control via some encoding options how Zarr stores simulation
data. Those options can be set for variables declared in process classes. See
the encoding
parameter of variable()
for all available
options. Encoding options may also be set or overridden when calling
run()
. A often encountered use-case for encoding,
is to suppress nan
values in the output dataset:
Default fill values¶
By default, output variables have a fill value that is set to np.nan in output. These values are determined during model runtime from the variable’s datatype. This only affects the output array: during the model run, the actual values are used.
datatype |
fill values |
example |
---|---|---|
(unsigned) integer |
maximum possible value |
uint8, 255 |
float |
np.nan |
|
string |
empty string |
“” |
bool |
False |
|
complex values |
default for each dtype |
Especially for boolean datatypes, where the default fill value is False
, it is
desireable to suppress this behaviour. There are several options, with different
benefits and drawbacks, as outlined below.
If you know beforehand what the fill_value
should be, this can be set directly in the process class:
@xs.process
class Foo:
v_bool_nan = xs.variable(dims="x", intent="out")
# suppress nan values by setting an explicit fill value:
v_bool = xs.variable(dims="x", intent="out", encoding={"fill_value": None})
def initialize(self):
self.v_bool_nan = [True, False]
self.v_bool = [True, False]
The resulting output is set to nan when no fill_value
is specified:
In [17]: model = xs.Model({"foo":Foo})
In [18]: in_ds = xs.create_setup(
....: model=model,
....: clocks={"clock": [0, 1]},
....: output_vars={"foo__v_bool": None, "foo__v_bool_nan": None},
....: )
....:
#this will result in nan values in output
In [19]: in_ds.xsimlab.run(model=model)
Out[19]:
<xarray.Dataset>
Dimensions: (clock: 2, x: 2)
Coordinates:
* clock (clock) int64 0 1
Dimensions without coordinates: x
Data variables:
foo__v_bool (x) bool True False
foo__v_bool_nan (x) object True nan
Alternatively, encoding (or decoding) options can be set during model run:
# set encoding options during model run In [20]: in_ds.xsimlab.run(model=model, encoding={"foo__v_bool_nan": {"fill_value": None}}) Out[20]: <xarray.Dataset> Dimensions: (clock: 2, x: 2) Coordinates: * clock (clock) int64 0 1 Dimensions without coordinates: x Data variables: foo__v_bool (x) bool True False foo__v_bool_nan (x) bool True False # set mask_and_scale to false In [21]: in_ds.xsimlab.run(model=model, decoding={"mask_and_scale": False}) Out[21]: <xarray.Dataset> Dimensions: (clock: 2, x: 2) Coordinates: * clock (clock) int64 0 1 Dimensions without coordinates: x Data variables: foo__v_bool (x) bool True False foo__v_bool_nan (x) bool True False
However, using mask_and_scale:False
results in non-serializeable attributes
in the output dataset, so the other alternatives are preferable.
Dynamically sized arrays¶
Model variables may have one or several of their dimension(s) dynamically resized during a simulation. When saving those variables as outputs, the corresponding zarr datasets may be resized so that, at the end of the simulation, all values are stored in large arrays of fixed shape and possibly containing missing values (note: depending on chunk size, zarr doesn’t need to physically store all regions of contiguous missing values).
The example below illustrates how such variables are returned as outputs:
In [22]: import numpy as np
In [23]: @xs.process
....: class Particles:
....: """Generate at each step a random number of particles
....: at random positions along an axis.
....: """
....:
....: position = xs.variable(dims='pt', intent='out')
....:
....: def initialize(self):
....: self._rng = np.random.default_rng(123)
....:
....: def run_step(self):
....: nparticles = self._rng.integers(1, 4)
....: self.position = self._rng.uniform(0, 10, size=nparticles)
....:
In [24]: model = xs.Model({'pt': Particles})
In [25]: with model:
....: in_ds = xs.create_setup(clocks={'steps': range(4)},
....: output_vars={'pt__position': 'steps'})
....: out_ds = in_ds.xsimlab.run()
....:
In [26]: out_ds.pt__position
Out[26]:
<xarray.DataArray 'pt__position' (steps: 4, pt: 3)>
array([[0.53821019, nan, nan],
[2.20359873, 1.84371811, 1.75905901],
[9.23344998, 2.76574398, nan],
[9.23344998, 2.76574398, nan]])
Coordinates:
* steps (steps) int64 0 1 2 3
Dimensions without coordinates: pt
N-dimensional arrays with missing values might not be the best format for
dealing with this kind of output data. It could still be converted into a denser
format, like for example a pandas.DataFrame
with a multi-index thanks
to the xarray Dataset or DataArray stack()
,
dropna()
and to_dataframe()
methods:
In [27]: (out_ds.stack(particles=('steps', 'pt'))
....: .dropna('particles')
....: .to_dataframe())
....:
Out[27]:
pt__position
steps pt
0 0 0.538210
1 0 2.203599
1 1.843718
2 1.759059
2 0 9.233450
1 2.765744
3 0 9.233450
1 2.765744