Architecture and Processing Flow ================================ This page describes the contributor-facing architecture of ``irdl`` and the conceptual flow of a ``Dataset.get(...)`` call. We focus on the extension points rather than every private implementation detail. Core architecture ----------------- ``BaseDataset`` The core abstraction for all Datasets. See :class:`~irdl.base.BaseDataset`. It defines the shared ``get`` pipeline: common parameter validation, cache path handling, retrieval/process orchestration, SOFA verification, and output conversion. Optional Dataset Family classes When multiple Datasets share dataset-specific steps in the get pipeline (see :ref:`get-processing-flow`), a Dataset Family class can be introduced to lift that shared logic into an intermediate base class. Do this only when the behaviour is genuinely shared; otherwise keep it in the individual Dataset class. Individual Dataset classes Each Dataset is implemented as an individual class. Its typed ``get()`` classmethod delegates to the shared pipeline, while the class itself implements that pipeline's dataset-specific steps (see :ref:`get-processing-flow`). The typed signature and NumPy-style docstring are also used to generate CLI parameters and help text automatically. Support modules Retrieval/repository helpers, CLI generation, logging/progress helpers, and small utility functions live outside the Dataset classes. Class hierarchy: .. code-block:: text BaseDataset └── Optional Dataset Family class └── Individual Dataset class Cache stages ------------ ``irdl`` uses three canonical Cache Stages: 1. ``provider`` Files exactly as the Dataset Provider delivers them. 2. ``ingest`` The single ingest-ready file that ``irdl`` can read into the internal SOFA representation. 3. ``output`` Files produced by converting the internal SOFA representation to disk-based Output Formats. Use these names in code comments and documentation. "Ingest-ready" is an adjective for a file in the ``ingest`` stage, not a separate stage name. .. _get-processing-flow: ``get`` processing flow ----------------------- A public ``Dataset.get(...)`` call delegates to the shared :class:`~irdl.base.BaseDataset` flow: .. code-block:: text Dataset.get(...) └─ BaseDataset._get(...) ├─ validate common and Dataset-specific parameters ├─ resolve provider / ingest / output paths ├─ [optional] raw output: retrieve provider artifact and return it ├─ reuse cached output if available ├─ reuse ingest file if available ├─ retrieve provider artifact if needed ├─ [optional] process provider file(s) into ingest file ├─ ingest to internal SOFA representation ├─ verify and upgrade SOFA convention └─ convert SOFA → requested Output Format Each Dataset has the following extension points: ``_validate_params()`` Mandatory. Validate Dataset-specific parameters and invalid parameter combinations. ``_source_filename()`` Mandatory. Return the canonical basename for the ingest-ready file. ``_download()`` Mandatory. Acquire provider-stage file(s) and return the primary provider artifact. ``_process()`` Optional. Transform provider-stage files into the single ingest-ready file. The default implementation skips processing and handles simple single-file promotion from ``provider`` to ``ingest``. ``_ingest()`` Mandatory. Read the ingest-ready file and return the internal SOFA representation. ``_to_output()`` Optional. Conversion methods normally stay in :class:`~irdl.base.BaseDataset`. New Datasets should not implement output-specific conversion unless the shared conversion layer itself needs to change. Output behavior --------------- ``output_format="raw"`` returns the provider-stage artifact before ``irdl`` processing. For all other Output Formats, ``irdl`` ingests the data to SOFA first and then converts from SOFA to the requested representation. See the public class docs in :doc:`/reference/python_api`. When ``export_dir`` is provided, ``irdl`` copies the requested artifact to the Export Directory. The cache remains intact so later calls can reuse provider, ingest, or output artifacts.