.. _adding_new_dataset_heading: Adding a new Dataset ==================== A new Dataset should fit into the shared :class:`~irdl.base.BaseDataset` ``get()`` flow rather than implementing its own retrieval pipeline. The Dataset-specific code covers three responsibilities: acquiring provider data, preparing a single file for the ``ingest`` stage, and reading that file into the internal SOFA representation. See :ref:`get-processing-flow` for the full list of extension points. Choose a base class ------------------- Always inherit from :class:`~irdl.base.BaseDataset` unless a more specific base class applies. Introduce a new Dataset Family class only when at least two Datasets share dataset-specific steps in the ``get`` pipeline. If the provider data is already SOFA-native, consider inheriting from :class:`~irdl.sofa.SofaBaseDataset`. :class:`~irdl.sofa.SofaBaseDataset` preserves the same shared flow but avoids unnecessary SOFA output rewrites when ``output_format="sofa"`` is requested. Implement the Dataset class --------------------------- A skeletal Dataset looks like this: .. code-block:: python from pathlib import Path import sofar as sf from irdl.base import BaseDataset class NewDataset(BaseDataset): """Retrieve and process the NEW Dataset.""" name = "newDataset" doi = "10.xxxx/example" @classmethod def get( cls, cache_dir: str | None = None, export_dir: str | None = None, output_format: str = "pyfar", *, scenario: str = "default", ): """ Parameters ---------- scenario : str Dataset-specific scenario to retrieve. """ return cls()._get( cache_dir=cache_dir, export_dir=export_dir, output_format=output_format, scenario=scenario, ) def _validate_params(self, **dataset_kwargs) -> None: scenario = dataset_kwargs["scenario"] if scenario not in {"default"}: raise ValueError("scenario must be 'default'") def _source_filename(self, **dataset_kwargs) -> str: scenario = dataset_kwargs["scenario"] return f"new-{scenario}.sofa" def _download(self, provider_dir: Path, **dataset_kwargs) -> Path: # Retrieve provider file(s) into provider_dir. # Return the primary provider artifact. # Note: The public download() method is a wrapper that calls this _download() method. raise NotImplementedError def _process(self, provider_artifact: Path, ingest_path: Path, **dataset_kwargs) -> Path: # Optional: extract, merge, rename, or convert provider data into ingest_path. # If the provider artifact is already a single ingest-ready file, the # BaseDataset implementation may be enough and this override can be removed. # Note: The public process() method is a wrapper that calls this _process() method. raise NotImplementedError def _ingest(self, ingest_path: Path) -> sf.Sofa: # Read ingest_path and return a sofar.Sofa object. raise NotImplementedError Keep this template intentionally small. Do not copy processing logic from another Dataset unless the new Dataset has the same provider format and needs the same transformation. Update public API and CLI ------------------------- After implementing the class, add it to ``src/irdl/__init__.py``. This exposes the Dataset as part of the public API: .. code-block:: python from .new_module import NewDataset as NewDataset The CLI is generated automatically from concrete :class:`~irdl.base.BaseDataset` subclasses imported by ``irdl``. Do not add hand-written CLI code for a new Dataset. The CLI command name comes from ``Dataset.name``; parameters and help text come from the typed ``get()`` signature and NumPy-style docstring. See :doc:`/reference/python_api` for API links. Document the Dataset -------------------- Dataset documentation is auto-generated by running ``make generated-docs`` in the docs directory, which executes ``docs/generate_dataset_docs.py``. This script: - Discovers all Dataset implementations from the ``irdl`` module - Groups them by their ``_category`` attribute (from :class:`~irdl.base.DatasetCategory`) - Generates category pages and individual dataset pages To categorize your Dataset, set the ``_category`` class attribute to one of the :class:`~irdl.base.DatasetCategory` values: .. code-block:: python from irdl.base import DatasetCategory class NewDataset(BaseDataset): _category = DatasetCategory.ROOM_IMPULSE_RESPONSES # or HEAD_RELATED_IMPULSE_RESPONSES If no handwritten prose is needed, the auto-generated page will contain just the class documentation. To add custom documentation, create an ``.rst`` fragment in ``docs/datasets/`` named after the Dataset ``name`` attribute (for example, ``name = "new"`` maps to ``docs/datasets/new.rst``). The generated Dataset page lives under ``docs/_generated/datasets/`` and automatically includes that fragment. See existing dataset files for examples. The documentation Makefile regenerates the dataset docs and CLI help while building the docs. Maintainers can refresh generated docs during review by running: .. code-block:: console $ uv run make -C docs html Manual verification and Evidence -------------------------------- Final verification should use the public ``get()`` path, not a direct private method call. For example: .. code-block:: python from irdl import NewDataset path = NewDataset.get(..., output_format="sofa") print(path) This exercises validation, provider acquisition, processing, ingest, SOFA verification, and output conversion. :class:`~irdl.base.BaseDataset` verifies the SOFA convention and prints diagnostics that are useful while implementing a Dataset, including during agent-assisted coding. In your contribution notes, include the command or Python snippet you ran and the smallest useful evidence that the retrieved data is correct, such as expected filenames, dimensions, sampling rate, coordinates, or Dataset metadata.