Develop: Main Developer Guide#

Thanks for helping develop pycmor! This document will guide you through the code structure and layout, and provide a few tips on how to contribute.

Getting Started#

To get started, you should clone the repository and install the dependencies. We give a few extra dependencies for testing and documentation, so you should install these as well:

git clone https://github.com/esm-tools/pycmor.git
cd pycmor
pip install -e ".[dev,doc]"

This will install the package in “editable” mode, so you can make changes to the code. The dev and doc extras will install the dependencies needed for testing and documentation, respectively. Before changing anything, make sure that the tests are passing:

pytest

Next, you should familiarize yourself with the code layout and the main building blocks. These are described in the next section.

Code Layout and Main Classes#

We use a src layout, with all files living under ./src/pycmor. The code is generally divided into several building blocks:

Rule is the main class that defines a rule for processing a CMOR variable.
Pipeline (or an object inherited from Pipeline) is a collection of actions that can be applied to a set of files described by a Rule. A few default pipelines are provided, and you can also define your own.
CMORizer is responsible for reading in the rules, and managing the various objects.

`Rule` Class#

This is the main building block for handling a set of files produced by a model. It has the following attributes:

input_patterns: A list of regular expressions that are used to match the input file name. Note that this is regex, not globbing!

cmor_variable: The CMOR name of the variable.

pipelines: A list of pipeline names that should be applied to the data.

Any other attributes can be added to the rule, and will appear in the rule_spec as attributes of the Rule object. In YAML, a minimal rule looks like this:

input_patterns: [".*"]
cmor_variable: tas
pipelines: [My Pipeline]

`Pipeline` Class#

The Pipeline class is a collection of actions that can be applied to a set of files. It should have a name attribute that describes the pipeline. If not given during construction, a random one is generated. The actions are stored in a list, and are applied in order. There are a few ways to construct a pipeline. You can either create one from a list of actions (also called steps):

>>> pipeline = pycmor.pipeline.Pipeline([action1, action2], name="My Pipeline")
>>> # Or use the class method:
>>> pl = Pipeline.from_list([action1, action2], name="My Pipeline")

where action1 and action2 are functions that follow the pipeline step protocol. See the guide on building actions for more information.

Another way to build actions is from a list of qualified names of functions. A class method is provided to do this easily:

>>> my_pipeline = Pipeline.from_qualnames(["my_module.my_action1", "my_module.my_action2"], name="My Pipeline")

`CMORizer` Class#

The CMORizer class is responsible for managing the rules and pipelines. It contains four configuration dictionaries:

pycmor.cfg: This is the configuration for the pycmor package. It should contain a version number, and any other configuration that is needed for the package to run. This is used to check that the configuration is correct for the specific version of pycmor. You can also specify certain features to be enabled or disabled here, as well as configure the logging.
global_cfg: This is the global configuration for the rules and pipelines. This is used for configuration that is common to all rules and pipelines, such as the path to the CMOR tables, or the path to the output directory. This is used to set up the environment for the rules and pipelines.
pipelines: This is a list of Pipeline objects that are used to process the data. These are the pipelines that are applied to the data, and are referenced by the rules. Each pipeline should have a unique name, and a series of steps to perform. You can also specify “frozen” arguments and key-word arguments to apply to steps in the pipeline’s configuration.
rules: This is a list of Rule objects that are used to match the data. Each rule should have a unique name, and a series of input patterns, a CMOR variable name, and a list of pipelines to apply to the data. You can also specify additional attributes that are used in the actions in the pipelines.

Building Actions for Pipelines#

When defining actions for a Pipeline, you should create functions with the following signature:

def my_action(data: Any,
              rule_spec: pycmor.rule.Rule,
              cmorizer: pycmor.cmorizer.CMORizer,
              *args, **kwargs) -> Any:
    ...
    return data

The data argument is the data that is passed from one action to the next. The rule_spec is the instance of the Rule class that is currently being evaluated. The cmorizer is the instance of the CMORizer class that is managing the pipeline. You can pass additional arguments to the action by using *args and **kwargs, however most arguments or keyword arguments should be extracted from the rule_spec. The action should return the data that will be passed to the next action in the pipeline. Note that the data can be any type, but it should be the same type as what is expected in the next action in the pipeline.

Note

If needed, you can construct “conversion” actions that will convert the data from one type to another and pass it to the next step.

When defining actions, you should also add a docstring that describes what the action does. This will be printed when the user asks for help on the action. Note that whenever possible, you should use the rule_spec to pass information into your action, rather than hardcoding it or passing in arguments. You can also use additional arguments if needed, and these can be fixed to always use the same values for the entire pipeline the action belongs to, or, alternatively, to the rule that the action is a part of. A few illustrative examples may make this clearer.

Example 1: A simple action that adds 1 to the data:

def add_one(data: Any, rule_spec: pycmor.rule.Rule, cmorizer: pycmor.cmorizer.CMORizer) -> Any:
    """Add one to the data."""
    return data + 1

Using this in a pipeline would look like this in Python code:

pipeline = pycmor.pipeline.Pipeline([add_one], name="Add One")
rule_spec = pycmor.rule.Rule(input_patterns=[".*"], cmor_variable="tas", pipelines=["Add One"])
cmorizer = pycmor.cmorizer.CMORizer(pycmor.cfg={"version": "unreleased"}, global_cfg={}, rules=[rule_spec], pipelines=[pipeline])
initial_data = 1
data = pipeline.run(initial_data, rule_spec, cmorizer)

In yaml, the same pipeline and configuration looks like this:

pycmor:
  version: unreleased

general:

pipelines:
  - name: Add One
    actions:
      - add_one
rules:
  - input_patterns: [".*"]
    cmor_variable: tas
    pipelines: [Add One]

Example 2: An action that sets an attribute on a xarray.Dataset, where this is specified in the rule specification:

def set_attribute(data: xr.Dataset, rule_spec: pycmor.rule.Rule, cmorizer: pycmor.cmorizer.CMORizer) -> xr.Dataset:
    """Set an attribute on the dataset."""
    data.attrs[rule_spec.attribute_name] = rule_spec.attribute_value
    return data

Using this in a pipeline would look like this in yaml:

pycmor:
  version: unreleased

general:

pipelines:
  - name: Set Attribute
    actions:
      - set_attribute
rules:
  - input_patterns: [".*"]
    cmor_variable: tas
    pipelines: [Set Attribute]
    attribute_name: "my_attribute"
    attribute_value: "my_value"

Example 3: An action that sets an attribute on a Dataset, where this is specified in the Pipeline.

It is the responsibility of the action developer to ensure arguments are passed correctly and have sensible values. This is a more complicated example. Here we check if the rule has a specific attribute that matches the action’s name, with “_args” appended. We use those values if that is the case. Otherwise, they can be obtained from the pipeline, and default to empty strings. As an action developer, you need to ensure sensible logic here!

def set_attribute(data: xr.Dataset, rule_spec: pycmor.rule.Rule, cmorizer: pycmor.cmorizer.CMORizer, attribute_name: str = "", attribute_value: str = "", *args, **kwargs) -> xr.Dataset:
    """Set an attribute on the dataset."""
    if hasattr(rule_spec, f"{__name__}_args"):
        attribute_name = getattr(rule_spec, f"{__name__}_args").get("attribute_name", my_attribute)
        attribute_value = getattr(rule_spec, f"{__name__}_args").get("attribute_value", my_value)
    data.attrs[attribute_name] = attribute_value
    return data

Using this in a pipeline would look like this in yaml:

pycmor:
  version: unreleased

general:

pipelines:
  - name: Set Attribute
    actions:
      - set_attribute
    attribute_name: "my_attribute"
    attribute_value: "my_value"
rules:
  - input_patterns: [".*"]
    cmor_variable: tas
    pipelines: [Set Attribute]

Important

In the case of passing arguments that are not in the rule spec, you need to be careful about where you place the information. The Rule should win, if there are conflicts between the rule and the pipeline. This is because the rule is the most specific, and the pipeline is the most general. So, to have a value specified in the rule, you should do:

pycmor:
  version: unreleased

general:

pipelines:
  - name: Set Attribute
    actions:
      - set_attribute
    attribute_name: "my_attribute"
    attribute_value: "my_value"
rules:
  - input_patterns: [".*"]
    cmor_variable: tas
    pipelines: [Set Attribute]
    set_attribute_args:
      attribute_name: "my_other_attribute"
      attribute_value: "my_other_value"

Attention

If you want more examples in the handbook, please open an issue or a pull request!

Develop: Main Developer Guide

Contents

Develop: Main Developer Guide#

Getting Started#

Code Layout and Main Classes#

Rule Class#

Pipeline Class#

CMORizer Class#

Building Actions for Pipelines#

`Rule` Class#

`Pipeline` Class#

`CMORizer` Class#