Architecture and Processing Flow#

This page describes the contributor-facing architecture of irdl and the conceptual flow of a Dataset.get(...) call. We focus on the extension points rather than every private implementation detail.

Core architecture#

BaseDataset

The core abstraction for all Datasets. See BaseDataset. It defines the shared get pipeline: common parameter validation, cache path handling, retrieval/process orchestration, SOFA verification, and output conversion.

Optional Dataset Family classes

When multiple Datasets share dataset-specific steps in the get pipeline (see get processing flow), a Dataset Family class can be introduced to lift that shared logic into an intermediate base class. Do this only when the behaviour is genuinely shared; otherwise keep it in the individual Dataset class.

Individual Dataset classes

Each Dataset is implemented as an individual class. Its typed get() classmethod delegates to the shared pipeline, while the class itself implements that pipeline’s dataset-specific steps (see get processing flow). The typed signature and NumPy-style docstring are also used to generate CLI parameters and help text automatically.

Support modules

Retrieval/repository helpers, CLI generation, logging/progress helpers, and small utility functions live outside the Dataset classes.

Class hierarchy:

BaseDataset
└── Optional Dataset Family class
    └── Individual Dataset class

Cache stages#

irdl uses three canonical Cache Stages:

  1. provider

    Files exactly as the Dataset Provider delivers them.

  2. ingest

    The single ingest-ready file that irdl can read into the internal SOFA representation.

  3. output

    Files produced by converting the internal SOFA representation to disk-based Output Formats.

Use these names in code comments and documentation. “Ingest-ready” is an adjective for a file in the ingest stage, not a separate stage name.

get processing flow#

A public Dataset.get(...) call delegates to the shared BaseDataset flow:

Dataset.get(...)
  └─ BaseDataset._get(...)
      ├─ validate common and Dataset-specific parameters
      ├─ resolve provider / ingest / output paths
      ├─ [optional] raw output: retrieve provider artifact and return it
      ├─ reuse cached output if available
      ├─ reuse ingest file if available
      ├─ retrieve provider artifact if needed
      ├─ [optional] process provider file(s) into ingest file
      ├─ ingest to internal SOFA representation
      ├─ verify and upgrade SOFA convention
      └─ convert SOFA → requested Output Format

Each Dataset has the following extension points:

_validate_params()

Mandatory. Validate Dataset-specific parameters and invalid parameter combinations.

_source_filename()

Mandatory. Return the canonical basename for the ingest-ready file.

_download()

Mandatory. Acquire provider-stage file(s) and return the primary provider artifact.

_process()

Optional. Transform provider-stage files into the single ingest-ready file. The default implementation skips processing and handles simple single-file promotion from provider to ingest.

_ingest()

Mandatory. Read the ingest-ready file and return the internal SOFA representation.

_to_output()

Optional. Conversion methods normally stay in BaseDataset. New Datasets should not implement output-specific conversion unless the shared conversion layer itself needs to change.

Output behavior#

output_format="raw" returns the provider-stage artifact before irdl processing. For all other Output Formats, irdl ingests the data to SOFA first and then converts from SOFA to the requested representation. See the public class docs in Python API.

When export_dir is provided, irdl copies the requested artifact to the Export Directory. The cache remains intact so later calls can reuse provider, ingest, or output artifacts.