pycmor.std_lib package

Contents

pycmor.std_lib package#

The Pycmor Standard Library#

The standard library contains functions that are included in the default pipelines, and are generally used as step functions. We expose several useful ones:

  • Unit Conversion

  • Time Averaging

  • Dataset Loading

  • Variable Extraction

  • Temporal Resampling

  • Trigger Compute

  • Show Data

  • Global Attributes

  • Variable Attributes

See the documentation for each of the steps for more details.

pycmor.std_lib.checkpoint_pipeline(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Insert a checkpoint in the pipeline processing.

This function allows for state saving during pipeline processing, which can be useful for debugging or resuming processing from a specific point.

Parameters:
Returns:

The input data (typically unchanged).

Return type:

xarray.DataArray or xarray.Dataset

Notes

Depending on the configuration in rule, this function might: - Save the current state to disk - Log the current state - Perform debugging operations

pycmor.std_lib.convert_units(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Convert units of a DataArray or Dataset based upon the Data Request Variable you have selected. Automatically handles chemical elements and dimensionless units.

Parameters:
Returns:

The converted data.

Return type:

xarray.DataArray or xarray.Dataset

pycmor.std_lib.get_variable(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Extract a variable from a dataset as a DataArray.

Parameters:
  • data (xarray.Dataset) – The dataset containing the variable to extract.

  • rule (Rule) – The rule containing the variable name to extract.

Returns:

The extracted variable as a DataArray.

Return type:

xarray.DataArray

Raises:

KeyError – If the variable specified in the rule does not exist in the dataset.

pycmor.std_lib.load_data(data: DataArray | Dataset | None, rule: Rule) DataArray | Dataset[source]#

Load data from files according to the rule specification.

This function opens and combines data from multiple files that match the pattern specified in the rule. It’s useful for loading time series data that may be spread across multiple files.

Parameters:
  • data (xarray.DataArray or xarray.Dataset or None) – Existing data (if any) to incorporate with loaded data.

  • rule (Rule) – The rule containing the input patterns and other specifications for loading the data.

Returns:

The loaded data combined into a single Dataset or DataArray.

Return type:

xarray.DataArray or xarray.Dataset

Notes

The rule_spec dictionary should contain an input_patterns key with a list of file patterns to match, e.g., [path/to/data/*.nc].

pycmor.std_lib.set_global_attributes(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Set global metadata attributes for a Dataset or DataArray.

This function applies standardized global attributes to the Dataset or DataArray based on the specifications in the rule, following conventions like CMIP6.

Parameters:
  • data (xarray.DataArray or xarray.Dataset) – The data to which global attributes will be added.

  • rule (Rule) – The rule containing the global attribute specifications.

Returns:

The data with updated global attributes.

Return type:

xarray.DataArray or xarray.Dataset

pycmor.std_lib.set_variable_attributes(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Set variable-specific metadata attributes.

This function applies standardized variable attributes to the Dataset or DataArray based on the specifications in the rule, following conventions like CMIP6.

Parameters:
  • data (xarray.DataArray or xarray.Dataset) – The data to which variable attributes will be added.

  • rule (Rule) – The rule containing the variable attribute specifications.

Returns:

The data with updated variable attributes.

Return type:

xarray.DataArray or xarray.Dataset

pycmor.std_lib.show_data(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Print data to screen for inspection and debugging purposes.

This function is useful during development and debugging to inspect the content and structure of DataArrays and Datasets.

Parameters:
Returns:

The input data (unchanged).

Return type:

xarray.DataArray or xarray.Dataset

pycmor.std_lib.temporal_resample(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Resample a DataArray or Dataset to a different temporal frequency.

Parameters:
  • data (xarray.DataArray or xarray.Dataset) – The data to resample.

  • rule (Rule) – The rule containing parameters for the resampling operation, including the frequency for resampling.

Returns:

The resampled data.

Return type:

xarray.DataArray or xarray.Dataset

Notes

This function resamples time series data to a different frequency. The frequency is determined from the rule (typically from data_request_variable.frequency). Common frequencies include: - ‘YS’: year start - ‘MS’: month start - ‘D’: daily - ‘H’: hourly

See also

https

//docs.xarray.dev/en/stable/user-guide/time-series.html#resampling-and-grouped-operations

pycmor.std_lib.time_average(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Compute the time average of a DataArray or Dataset based upon the Data Request Variable you have selected.

Parameters:
  • data (xarray.DataArray or xarray.Dataset) – The data to average.

  • rule (Rule) – The rule specifying parameters for time averaging, such as the time period or method to use for averaging.

Returns:

The averaged data.

Return type:

xarray.DataArray or xarray.Dataset

pycmor.std_lib.trigger_compute(data: DataArray | Dataset, rule: Rule) DataArray | Dataset[source]#

Trigger computation of lazy (dask-backed) data operations.

This function is useful to ensure that all pending computations are executed before proceeding with the next steps in a pipeline. It’s particularly important before saving data to files.

Parameters:
Returns:

The computed data with all operations applied.

Return type:

xarray.DataArray or xarray.Dataset

Submodules#

pycmor.std_lib.dataset_helpers module#

pycmor.std_lib.dataset_helpers.freq_is_coarser_than_data(freq: str, ds: Dataset, ref_time: Timestamp = Timestamp('1970-01-01 00:00:00')) bool[source]#

Checks if the frequency is coarser than the time frequency of the xarray Dataset.

Parameters:
  • freq (str) – The frequency to compare (e.g. ‘M’, ‘D’, ‘6H’).

  • ds (xr.Dataset) – The dataset containing a time coordinate.

  • ref_time (pd.Timestamp, optional) – Reference timestamp used to convert frequency to a time delta. Defaults to the beginning of the Unix Epoch.

Returns:

True if freq is coarser (covers a longer duration) than the dataset’s frequency.

Return type:

bool

pycmor.std_lib.dataset_helpers.get_time_label(ds)[source]#

Determines the name of the coordinate in the dataset that can serve as a time label.

Parameters:

ds (xarray.Dataset) – The dataset containing coordinates to check for a time label.

Returns:

The name of the coordinate that is a datetime type and can serve as a time label, or None if no such coordinate is found.

Return type:

str or None

Example

>>> import xarray as xr
>>> import pandas as pd
>>> import numpy as np
>>> ds = xr.Dataset({'time': ('time', pd.date_range('2000-01-01', periods=10))})
>>> get_time_label(ds)
'time'
>>> ds = xr.DataArray(np.ones(10), coords={'T': ('T', pd.date_range('2000-01-01', periods=10))})
>>> get_time_label(ds)
'T'
>>> # The following does have a valid time coordinate, expected to return None
>>> da = xr.Dataset({'time': ('time', [1,2,3,4,5])})
>>> get_time_label(da) is None
True
pycmor.std_lib.dataset_helpers.has_time_axis(ds) bool[source]#

Checks if the given dataset has a time axis.

Parameters:

ds (xarray.Dataset or xarray.DataArray) – The dataset to check.

Returns:

True if the dataset has a time axis, False otherwise.

Return type:

bool

pycmor.std_lib.dataset_helpers.is_datetime_type(arr: ndarray) bool[source]#

Checks if array elements are datetime objects or cftime objects

pycmor.std_lib.dataset_helpers.needs_resampling(ds, timespan)[source]#

Checks if a given dataset needs resampling based on its time axis.

Parameters:
  • ds (xr.Dataset or xr.DataArray) – The dataset to check.

  • timespan (str) – The time span for which the dataset is to be resampled. 10YS, 1YS, 6MS, etc.

Returns:

  • bool – True if the dataset needs resampling, False otherwise.

  • Notes

  • ——

  • After time-averaging step, this function aids in determining if

  • splitting into multiple files is required based on provided

  • timespan.

pycmor.std_lib.exceptions module#

This module contains custom exceptions that you should raise when something specific goes wrong in the standard library.

exception pycmor.std_lib.exceptions.PycmorError[source]#

Bases: Exception

Base class for all errors raised by pycmor.

exception pycmor.std_lib.exceptions.PycmorResamplingError[source]#

Bases: PycmorError

Error raised when resampling fails.

exception pycmor.std_lib.exceptions.PycmorResamplingTimeAxisIncompatibilityError[source]#

Bases: PycmorResamplingError, ValueError

Error raised when resampling fails due to time axis incompatibility.

pycmor.std_lib.files module#

This module contains functions for handling file-related operations in the pycmor package. It includes functions for creating filepaths based on given rules and datasets, and for saving the resulting datasets to the generated filepaths.

Table 2: Precision of time labels used in file names |---------------+-------------------+-----------------------------------------------| | Frequency | Precision of time | Notes | | | label | | |---------------+-------------------+-----------------------------------------------| | yr, dec, | “yyyy” | Label with the years recorded in the first | | yrPt | | and last coordinate values. | |---------------+-------------------+-----------------------------------------------| | mon, monC | “yyyyMM” | For “mon”, label with the months recorded in | | | | the first and last coordinate values; for | | | | “monC” label with the first and last months | | | | contributing to the climatology. | |---------------+-------------------+-----------------------------------------------| | day | “yyyyMMdd” | Label with the days recorded in the first and | | | | last coordinate values. | |---------------+-------------------+-----------------------------------------------| | 6hr, 3hr, | “yyyyMMddhhmm” | Label 1hrCM files with the beginning of the | | 1hr, | | first hour and the end of the last hour | | 1hrCM, 6hrPt, | | contributing to climatology (rounded to the | | 3hrPt, | | nearest minute); for other frequencies in | | 1hrPt | | this category, label with the first and last | | | | time-coordinate values (rounded to the | | | | nearest minute). | |---------------+-------------------+-----------------------------------------------| | subhrPt | “yyyyMMddhhmmss” | Label with the first and last time-coordinate | | | | values (rounded to the nearest second) | |---------------+-------------------+-----------------------------------------------| | fx | Omit time label | This frequency applies to variables that are | | | | independent of time (“fixed”). | |---------------+-------------------+-----------------------------------------------|

pycmor.std_lib.files._filename_time_range(ds, rule) str[source]#

Determine the time range used in naming the file.

Parameters:
  • ds (xarray.Dataset) – The input dataset.

  • rule (Rule) – The rule object containing information for generating the filepath.

Returns:

time_range in filepath.

Return type:

str

pycmor.std_lib.files._save_dataset_with_native_timespan(da, rule, time_label, time_encoding, **extra_kwargs)[source]#
pycmor.std_lib.files.create_filepath(ds, rule)[source]#

Generate a filepath when given an xarray dataset and a rule.

This function generates a filepath for the output file based on the given dataset and rule. The filepath includes the name, table_id, institution, source_id, experiment_id, label, grid, and optionally the start and end time.

Parameters:
  • ds (xarray.Dataset) – The input dataset.

  • rule (Rule) – The rule object containing information for generating the filepath.

Returns:

The generated filepath.

Return type:

str

Notes

The rule object should have the following attributes: cmor_variable, data_request_variable, variant_label, source_id, experiment_id, output_directory, and optionally institution.

pycmor.std_lib.files.file_timespan_tail(rule)[source]#

Grab the last timestamp in each file and return them as a list. Also account for offset (if any) defined on the rule

pycmor.std_lib.files.get_offset(rule)[source]#

convert offset defined on the rule to a timedelta.

pycmor.std_lib.files.save_dataset(da: DataArray, rule)[source]#

Save dataset to one or more files.

Parameters:
  • da (xr.DataArray) – The dataset to be saved.

  • rule (Rule) – The rule object containing information for generating the filepath.

Return type:

None

Notes

If the dataset does not have a time axis, or if the time axis is a scalar, this function will save the dataset to a single file. Otherwise, it will split the dataset into chunks based on the time axis and save each chunk to a separate file.

The filepath will be generated based on the rule object and the time range of the dataset. The filepath will include the name, table_id, institution, source_id, experiment_id, label, grid, and optionally the start and end time.

If the dataset needs resampling (i.e., the time axis does not align with the time frequency specified in the rule object), this function will split the dataset into chunks based on the time axis and resample each chunk to the specified frequency. The resampled chunks will then be saved to separate files.

NOTE: prior to calling this function, call dask.compute() method, otherwise tasks will progress very slow.

pycmor.std_lib.files.split_data_timespan(ds, rule)[source]#

Splits the dataset into chunks based on the time axis as defined in the source files.

Parameters:
  • ds (xarray.Dataset) – The dataset to split.

  • rule (Rule) – The rule object containing information for generating the filepath.

Returns:

A list of datasets, each containing a chunk of the original dataset.

Return type:

list

pycmor.std_lib.generic module#

Generic#

This module, generic.py, provides functionalities for transforming and standardizing NetCDF files according to CMOR.

It contains several functions and classes:

Functions (can be used as actions in Rule objects): - linear_transform: Applies a linear transformation to the data of a NetCDF file. - invert_z_axis: Inverts the z-axis of a NetCDF file.

The Full CMOR (yes, bad pun):
  • Applied if no other rule sets are given for a file

  • Adds CMOR metadata to the file

  • Converts units

  • Performs time averaging

pycmor.std_lib.generic.create_cmor_directories(config: dict) dict[source]#

Creates the directory structure for the CMORized files.

Parameters:

config (dict) – The pymor configuration dictionary

See also

https

//docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit

pycmor.std_lib.generic.dummy_load_data(data, rule_spec, *args, **kwargs)[source]#

A dummy function for testing. Loads the xarray tutorial data

pycmor.std_lib.generic.dummy_logic_step(data, rule_spec, *args, **kwargs)[source]#

A dummy function for testing. Prints data to screen and adds a dummy attribute to the data.

pycmor.std_lib.generic.dummy_save_data(data, rule_spec, *args, **kwargs)[source]#

A dummy function for testing. Saves the data to a netcdf file.

pycmor.std_lib.generic.dummy_sleep(data, rule_spec, *arg, **kwargs)[source]#

A dummy function for testing. Sleeps for 5 seconds.

pycmor.std_lib.generic.get_variable(data, rule_spec, *args, **kwargs)[source]#

Gets a particular variable out of a xr.Dataset

Parameters:
  • data (xr.Dataset) – Assumes data is a dataset already. No checks are done for this!!

  • rule_spec (Rule) – Rule describing the DataRequestVariable for this pipeline run

Return type:

xr.DataArray

pycmor.std_lib.generic.invert_z_axis(filepath: Path, execute: bool = False, flip_sign: bool = False)[source]#

Inverts the z-axis of a NetCDF file.

Parameters:
  • filepath (Path) – Path to the input file.

  • execute (bool, optional) – If True, the function will execute the inversion. If False, it will only print the changes that would be made.

pycmor.std_lib.generic.linear_transform(filepath: Path, execute: bool = False, slope: float = 1, offset: float = 0)[source]#

Applies a linear transformation to the data of a NetCDF file.

Parameters:
  • filepath (Path) – Path to the input file.

  • execute (bool, optional)

  • slope (float, optional)

  • offset (float, optional)

pycmor.std_lib.generic.load_data(data, rule_spec, *args, **kwargs)[source]#

Loads data described by the rule_spec.

pycmor.std_lib.generic.multiyear_monthly_mean(data, rule_spec, *args, **kwargs)[source]#
pycmor.std_lib.generic.rename_dims(data, rule_spec)[source]#

Renames the dimensions of the array based on the key/values of rule_spec[“model_dim”]

pycmor.std_lib.generic.resample_monthly(data, rule_spec, *args, **kwargs)[source]#

monthly means per year

pycmor.std_lib.generic.resample_yearly(data, rule_spec, *args, **kwargs)[source]#

monthly means per year

pycmor.std_lib.generic.show_data(data, rule_spec, *args, **kwargs)[source]#

Prints data to screen. Useful for debugging

pycmor.std_lib.generic.sort_dimensions(data, rule_spec)[source]#

Sorts the dimensions of a DataArray based on the array_order attribute of the rule_spec. If the array_order attribute is not present, it is inferred from the dimensions attribute of the data request variable.

pycmor.std_lib.generic.trigger_compute(data, rule_spec, *args, **kwargs)[source]#

pycmor.std_lib.global_attributes module#

class pycmor.std_lib.global_attributes.CMIP6GlobalAttributes(drv, cv, rule_dict)[source]#

Bases: GlobalAttributes

_registry = {}#
_variant_label_components(label: str)[source]#
get_Conventions()[source]#
get_activity_id()[source]#
get_creation_date()[source]#
get_data_specs_version()[source]#
get_experiment()[source]#
get_experiment_id()[source]#
get_forcing_index()[source]#
get_frequency()[source]#
get_further_info_url()[source]#
get_grid()[source]#
get_grid_label()[source]#
get_initialization_index()[source]#
get_institution()[source]#
get_institution_id()[source]#
get_license()[source]#
get_mip_era()[source]#
get_nominal_resolution()[source]#
get_physics_index()[source]#
get_product()[source]#
get_realization_index()[source]#
get_realm()[source]#
get_source()[source]#
get_source_id()[source]#
get_source_type()[source]#
get_sub_experiment()[source]#
get_sub_experiment_id()[source]#
get_table_id()[source]#
get_tracking_id()[source]#
get_variable_id()[source]#
get_variant_label()[source]#
global_attributes() dict[source]#
property required_global_attributes#
subdir_path() str[source]#
class pycmor.std_lib.global_attributes.CMIP7GlobalAttributes[source]#

Bases: GlobalAttributes

_registry = {}#
global_attributes()[source]#
subdir_path()[source]#
class pycmor.std_lib.global_attributes.GlobalAttributes[source]#

Bases: object

_registry = {'CMIP6': <class 'pycmor.std_lib.global_attributes.CMIP6GlobalAttributes'>, 'CMIP7': <class 'pycmor.std_lib.global_attributes.CMIP7GlobalAttributes'>}#
abstractmethod global_attributes()[source]#
abstractmethod subdir_path()[source]#
pycmor.std_lib.global_attributes.set_global_attributes(ds, rule)[source]#

Set global attributes for the dataset

pycmor.std_lib.setgrid module#

Set grid information on the data file.

xarray does not have a built-in setgrid operator unlike cdo. Using xarray.merge directly to merge grid with data may or may not produce the desired result all the time.

Some guiding rules to set the grid information:

  1. At least one dimension size in both data file and grid file should match.

  2. If the dimension size match but not the dimension name, then the dimension name in data file is renamed to match the dimension name in grid file.

  3. The matching dimension size must be one of the coordinate variables in both data file and grid file.

  4. If all above conditions are met, then the data file is merged with the grid file.

  5. The coordinate variables and boundary variables (lat_bnds, lon_bnds) from the grid file are kept, while other data variables in grid file are dropped.

  6. The result of the merge is always a xarray.Dataset

Note: Rule 5 is not strict and may go away if it is not desired.

pycmor.std_lib.setgrid.setgrid(da: Dataset | DataArray, rule: Rule) Dataset | DataArray[source]#

Appends grid information to data file if necessary coordinate dimensions exits in data file. Renames dimensions in data file to match the dimension names in grid file if necessary.

Parameters:
  • da (xr.Dataset or xr.DataArray) – The input dataarray or dataset.

  • rule (Rule object containing gridfile attribute)

Returns:

The output dataarray or dataset with the grid information.

Return type:

xr.Dataset

pycmor.std_lib.timeaverage module#

Time Averaging#

This module contains functions for time averaging of data arrays.

The approximate interval for time averaging is prescribed in the CMOR tables, using the key 'approx_interval'. This information is also provided within the library.

Functions#

_get_time_method(frequency: str) -> str:

Determine the time method based on the frequency string from rule.data_request_variable.frequency.

_frequency_from_approx_interval(interval: str) -> str:

Convert an interval expressed in days to a frequency string.

timeavg(da: xr.DataArray, rule: Dict) -> xr.DataArray:

Time averages data with respect to time-method (mean/climatology/instant.)

Module Variables#

_IGNORED_CELL_METHODSlist

List of cell_methods to ignore when calculating time averages.

pycmor.std_lib.timeaverage._IGNORED_CELL_METHODS = ['area: depth: time: mean', 'area: mean', 'area: mean (comment: over land and sea ice) time: point', 'area: mean time: maximum', 'area: mean time: maximum within days time: mean over days', 'area: mean time: mean within days time: mean over days', 'area: mean time: mean within hours time: maximum over hours', 'area: mean time: mean within years time: mean over years', 'area: mean time: minimum', 'area: mean time: minimum within days time: mean over days', 'area: mean time: point', 'area: mean time: sum', 'area: mean where crops time: maximum', 'area: mean where crops time: maximum within days time: mean over days', 'area: mean where crops time: minimum', 'area: mean where crops time: minimum within days time: mean over days', 'area: mean where grounded_ice_sheet', 'area: mean where ice_free_sea over sea time: mean', 'area: mean where ice_sheet', 'area: mean where land', 'area: mean where land over all_area_types time: mean', 'area: mean where land over all_area_types time: point', 'area: mean where land over all_area_types time: sum', 'area: mean where land time: mean', 'area: mean where land time: mean (with samples weighted by snow mass)', 'area: mean where land time: point', 'area: mean where sea', 'area: mean where sea depth: sum where sea (top 100m only) time: mean', 'area: mean where sea depth: sum where sea time: mean', 'area: mean where sea time: mean', 'area: mean where sea time: point', 'area: mean where sea_ice (comment: mask=siconc) time: point', 'area: mean where sector time: point', 'area: mean where snow over sea_ice area: time: mean where sea_ice', 'area: point', 'area: point time: point', 'area: sum', 'area: sum where ice_sheet time: mean', 'area: sum where sea time: mean', 'area: time: mean', 'area: time: mean (comment: over land and sea ice)', 'area: time: mean where cloud', 'area: time: mean where crops (comment: mask=cropFrac)', 'area: time: mean where floating_ice_shelf (comment: mask=sftflf)', 'area: time: mean where grounded_ice_sheet (comment: mask=sfgrlf)', 'area: time: mean where ice_sheet', 'area: time: mean where natural_grasses (comment: mask=grassFrac)', 'area: time: mean where pastures (comment: mask=pastureFrac)', 'area: time: mean where sea_ice (comment: mask=siconc)', 'area: time: mean where sea_ice (comment: mask=siconca)', 'area: time: mean where sea_ice (comment: mask=siitdconc)', 'area: time: mean where sea_ice_melt_pond (comment: mask=simpconc)', 'area: time: mean where sea_ice_ridges (comment: mask=sirdgconc)', 'area: time: mean where sector', 'area: time: mean where shrubs (comment: mask=shrubFrac)', 'area: time: mean where snow (comment: mask=snc)', 'area: time: mean where trees (comment: mask=treeFrac)', 'area: time: mean where unfrozen_soil', 'area: time: mean where vegetation (comment: mask=vegFrac)', 'longitude: mean time: mean', 'longitude: mean time: point', 'longitude: sum (comment: basin sum [along zig-zag grid path]) depth: sum time: mean', 'time: mean', 'time: mean grid_longitude: mean', 'time: point']#

cell_methods to ignore when calculating time averages

Type:

list

pycmor.std_lib.timeaverage._frequency_from_approx_interval(interval: str)[source]#

Convert an interval expressed in days to a frequency string.

This function takes an interval expressed in days and converts it to a frequency string in a suitable time unit (decade, year, month, day, hour, minute, second, millisecond). The conversion is based on an approximate number of days for each time unit.

Parameters:

interval (str) – The interval expressed in days.

Returns:

The frequency string in a suitable time unit.

Return type:

str

Raises:

ValueError – If the interval cannot be converted to a float.

pycmor.std_lib.timeaverage._get_time_method(frequency: str) str[source]#

Determine the time method based on the frequency string from CMIP6 table for a specific variable (rule.data_request_variable.frequency).

The type of time method influences how the data is processed for time averaging.

Parameters:

frequency (str) – The frequency string from CMIP6 tables (example: “mon”).

Returns:

The corresponding time method (‘INSTANTANEOUS’, ‘CLIMATOLOGY’, or ‘MEAN’).

Return type:

str

pycmor.std_lib.timeaverage.custom_resample(df, freq='M', offset=0.5, func='mean')[source]#

Resample a DataFrame and place timestamps at a custom offset within each period.

Parameters:
  • df (DataFrame) – DataFrame with a DatetimeIndex

  • freq (str) – Frequency string (e.g., ‘M’ for month, ‘Y’ for year)

  • offset (float) – Float between 0 and 1, representing the position within each period

  • func (str) – Resampling function (e.g., ‘mean’, ‘sum’, ‘max’)

Returns:

Resampled DataFrame with adjusted timestamps

Return type:

DataFrame

Examples

First, set up our imports and random seed:

>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(42)
>>> date_rng = pd.date_range(start="2023-01-01", end="2023-12-31", freq="D")
>>> df = pd.DataFrame({"value": rng.random(len(date_rng))}, index=date_rng)

Test mid-month resampling:

>>> df_month_mid = custom_resample(df, freq="ME", offset=0.5)
>>> print(df_month_mid.head())
                        value
2023-01-16 00:00:00  0.565127
2023-02-14 12:00:00  0.484111
2023-03-16 00:00:00  0.434221
2023-04-15 12:00:00  0.510354
2023-05-16 00:00:00  0.443399

Test mid-year resampling:

>>> df_year_mid = custom_resample(df, freq="YE", offset=0.5)
>>> print(df_year_mid)
               value
2023-07-02  0.492457

Test mid-week resampling:

>>> df_week_mid = custom_resample(df, freq="W", offset=0.5)
>>> print(df_week_mid.head())
               value
2023-01-01  0.773956
2023-01-05  0.658835
2023-01-12  0.540872
2023-01-19  0.488221
2023-01-26  0.500237

Test one-third through each month:

>>> df_month_third = custom_resample(df, freq="ME", offset=1/3)
>>> print(df_month_third.head())
                        value
2023-01-11 00:00:00  0.565127
2023-02-10 00:00:00  0.484111
2023-03-11 00:00:00  0.434221
2023-04-10 16:00:00  0.510354
2023-05-11 00:00:00  0.443399

Test quarter-end resampling:

>>> df_quarter_end = custom_resample(df, freq="QE", offset=1)
>>> print(df_quarter_end)
               value
2023-03-31  0.494832
2023-06-30  0.496207
2023-09-30  0.461806
2023-12-31  0.517077

Test with irregular time series:

>>> irregular_dates = pd.date_range("2023-01-01", periods=100, freq="D").tolist()
>>> irregular_dates += pd.date_range("2023-05-01", periods=50, freq="2D").tolist()
>>> irregular_dates += pd.date_range("2023-07-01", periods=30, freq="3D").tolist()
>>> df_irregular = pd.DataFrame({"value": rng.random(len(irregular_dates))}, index=irregular_dates)
>>> df_irregular_month = custom_resample(df_irregular, freq="ME", offset=0.5)
>>> print(df_irregular_month.head())
                        value
2023-01-16 00:00:00  0.543549
2023-02-14 12:00:00  0.485275
2023-03-16 00:00:00  0.513365
2023-04-05 12:00:00  0.558554
2023-05-16 00:00:00  0.447175
pycmor.std_lib.timeaverage.timeavg(da: DataArray, rule)[source]#

Time averages data with respect to time-method (mean/climatology/instant.)

This function takes a data array and a rule, computes the timespan of the data array, and then performs time averaging based on the time method specified in the rule. The time methods can be "INSTANTANEOUS", "MEAN", or "CLIMATOLOGY".

For "MEAN" time method, the timestamps can be adjusted using the adjust_timestamp parameter in the rule dict.

This can be either: - A float between 0 and 1 representing the position within each period (e.g., 0.5 for mid-point) - A string preset: “first”/”start” (0.0), “last”/”end” (1.0), “mid”/”middle” (0.5) - A pandas offset string (e.g., “2d” for 2 days offset)

This feature is useful for setting consistent mid-month dates by setting adjust_timestamp to “14d”.

Parameters:
  • da (xr.DataArray) – The data array to compute the timespan for.

  • rule (dict) – The rule dict containing the time method and other parameters. For “MEAN” time method, can include ‘adjust_timestamp’ to control timestamp positioning.

Returns:

The time averaged data array.

Return type:

xr.DataArray

pycmor.std_lib.units module#

This module deals with the auto-unit conversion in the cmorization process. In case the units in model files differ from CMIP Tables, this module attempts to convert them automatically.

Conversion to-or-from a dimensionless quantity is ambiguous. In this case, provide a mapping of what this dimensionless quantity represents and that is used for the conversion. data/dimensionless_mappings.yaml contains some examples on how the mapping is written.

handle_unit_conversion() is the only function users care about as it handles the unit conversion of an xr.DataArray according to a Rule. The rest of the functions in this module are support functions.

pycmor.std_lib.units._get_units(da: DataArray, rule: Rule) tuple[str, str, str][source]#

Get the units from a DataArray and a Rule.

This function extracts the units from a DataArray and a Rule. If the Rule contains a model_units entry, this takes precedence over the units defined in the dataset. The function also handles dimensionless units by looking up a unit alias in the dimensionless_unit_mappings dictionary of the Rule.

Parameters:
  • da (xarray.DataArray) – The DataArray to extract the units from.

  • rule (dict) – The Rule to extract the units from.

Returns:

  • from_unit (str) – The unit of the DataArray.

  • to_unit (str) – The unit to convert the DataArray to.

  • to_unit_dimensionless_mapping (str) – The unit alias used for representing the to_unit.

pycmor.std_lib.units.convert(da: DataArray, from_unit: str, to_unit: str, to_unit_dimensionless_mapping: str | None = None) DataArray[source]#

Convert a DataArray from one unit to another.

This function handles the conversion of a xarray.DataArray from one unit to another, taking into account chemical symbols and scaling factor in units. It uses the pint library for unit conversion and supports aliasing of target units.

Parameters:
  • da (xarray.DataArray) – The DataArray to be converted.

  • from_unit (str) – The unit of the input DataArray.

  • to_unit (str) – The unit to convert the DataArray to.

  • to_unit_dimensionless_mapping (str, optional) – An alias for the target unit, if any. Defaults to None.

Returns:

The converted DataArray with the new unit.

Return type:

xarray.DataArray

Raises:

ValueError – If the conversion between the specified units is not possible.

pycmor.std_lib.units.handle_chemicals(s: str | None = None, pattern: Pattern = re.compile('mol(?P<symbol>\\w+)')) None[source]#

Handle units containing chemical symbols.

If the unit string contains a chemical symbol (e.g. molNaCl), Pint will raise an error because it does not know the definition of the chemical symbol. This function attempts to detect chemical symbols in the unit string and register a unit definition for it with the aid of chemicals package.

Parameters:
  • s (str) – The unit string to parse.

  • pattern (re.Pattern, optional) – The regular expression pattern to use for searching for chemical symbols in the unit string. Defaults to a pattern that matches “mol” followed by any number of word characters.

Return type:

None

Raises:

ValueError – If the chemical symbol is not recognized.

See also

periodic_table

Periodic table of elements

compile

Python’s regex syntax.

pycmor.std_lib.units.handle_scalar_units(da: DataArray, from_unit: str, to: str) DataArray[source]#

Convert a DataArray with scalar units from one unit to another.

This function handles the conversion of a xarray.DataArray containing scalar units to another unit. The function uses the pint library for unit conversion. If the initial quantification fails due to an undefined unit, it attempts to assign and quantify the unit manually.

Parameters:
  • da (xarray.DataArray) – The DataArray to be converted.

  • from_unit (str) – The unit of the input DataArray.

  • to (str) – The unit to convert the DataArray to.

Returns:

The converted DataArray with the new unit.

Return type:

xarray.DataArray

Raises:

ValueError – If the conversion between the specified units is not possible.

pycmor.std_lib.units.handle_unit_conversion(da: DataArray, rule: Rule) DataArray[source]#

Handle unit conversion of a DataArray according to a Rule.

This function applies the necessary unit conversion to a DataArray based on the units defined in the Rule. It takes into account user-defined units, chemical symbols and dimensionless units.

Parameters:
  • da (xarray.DataArray) – The DataArray to be converted.

  • rule (dict) – The Rule containing the units to convert to.

Returns:

The converted DataArray with the new unit.

Return type:

xarray.DataArray

pycmor.std_lib.variable_attributes module#

Pipeline steps to attach metadata attributes to the xarrays

pycmor.std_lib.variable_attributes.set_variable_attributes(ds: Dataset | DataArray, rule: Rule) Dataset | DataArray#
pycmor.std_lib.variable_attributes.set_variable_attrs(ds: Dataset | DataArray, rule: Rule) Dataset | DataArray[source]#