Adding a new Dataset#

A new Dataset should fit into the shared BaseDataset get() flow rather than implementing its own retrieval pipeline. The Dataset-specific code covers three responsibilities: acquiring provider data, preparing a single file for the ingest stage, and reading that file into the internal SOFA representation. See get processing flow for the full list of extension points.

Choose a base class#

Always inherit from BaseDataset unless a more specific base class applies. Introduce a new Dataset Family class only when at least two Datasets share dataset-specific steps in the get pipeline.

If the provider data is already SOFA-native, consider inheriting from SofaBaseDataset. SofaBaseDataset preserves the same shared flow but avoids unnecessary SOFA output rewrites when output_format="sofa" is requested.

Implement the Dataset class#

A skeletal Dataset looks like this:

from pathlib import Path

import sofar as sf

from irdl.base import BaseDataset


class NewDataset(BaseDataset):
    """Retrieve and process the NEW Dataset."""

    name = "newDataset"
    doi = "10.xxxx/example"

    @classmethod
    def get(
        cls,
        cache_dir: str | None = None,
        export_dir: str | None = None,
        output_format: str = "pyfar",
        *,
        scenario: str = "default",
    ):
        """

        Parameters
        ----------
        scenario : str
            Dataset-specific scenario to retrieve.
        """
        return cls()._get(
            cache_dir=cache_dir,
            export_dir=export_dir,
            output_format=output_format,
            scenario=scenario,
        )

    def _validate_params(self, **dataset_kwargs) -> None:
        scenario = dataset_kwargs["scenario"]
        if scenario not in {"default"}:
            raise ValueError("scenario must be 'default'")

    def _source_filename(self, **dataset_kwargs) -> str:
        scenario = dataset_kwargs["scenario"]
        return f"new-{scenario}.sofa"

    def _download(self, provider_dir: Path, **dataset_kwargs) -> Path:
        # Retrieve provider file(s) into provider_dir.
        # Return the primary provider artifact.
        # Note: The public download() method is a wrapper that calls this _download() method.
        raise NotImplementedError

    def _process(self, provider_artifact: Path, ingest_path: Path, **dataset_kwargs) -> Path:
        # Optional: extract, merge, rename, or convert provider data into ingest_path.
        # If the provider artifact is already a single ingest-ready file, the
        # BaseDataset implementation may be enough and this override can be removed.
        # Note: The public process() method is a wrapper that calls this _process() method.
        raise NotImplementedError

    def _ingest(self, ingest_path: Path) -> sf.Sofa:
        # Read ingest_path and return a sofar.Sofa object.
        raise NotImplementedError

Keep this template intentionally small. Do not copy processing logic from another Dataset unless the new Dataset has the same provider format and needs the same transformation.

Update public API and CLI#

After implementing the class, add it to src/irdl/__init__.py. This exposes the Dataset as part of the public API:

from .new_module import NewDataset as NewDataset

The CLI is generated automatically from concrete BaseDataset subclasses imported by irdl. Do not add hand-written CLI code for a new Dataset. The CLI command name comes from Dataset.name; parameters and help text come from the typed get() signature and NumPy-style docstring. See Python API for API links.

Document the Dataset#

Dataset documentation is auto-generated by running make generated-docs in the docs directory, which executes docs/generate_dataset_docs.py. This script:

Discovers all Dataset implementations from the irdl module
Groups them by their _category attribute (from DatasetCategory)
Generates category pages and individual dataset pages

To categorize your Dataset, set the _category class attribute to one of the DatasetCategory values:

from irdl.base import DatasetCategory

class NewDataset(BaseDataset):
    _category = DatasetCategory.ROOM_IMPULSE_RESPONSES  # or HEAD_RELATED_IMPULSE_RESPONSES

If no handwritten prose is needed, the auto-generated page will contain just the class documentation. To add custom documentation, create an .rst fragment in docs/datasets/ named after the Dataset name attribute (for example, name = "new" maps to docs/datasets/new.rst). The generated Dataset page lives under docs/_generated/datasets/ and automatically includes that fragment. See existing dataset files for examples.

The documentation Makefile regenerates the dataset docs and CLI help while building the docs. Maintainers can refresh generated docs during review by running:

$ uv run make -C docs html

Manual verification and Evidence#

Final verification should use the public get() path, not a direct private method call. For example:

from irdl import NewDataset

path = NewDataset.get(..., output_format="sofa")
print(path)

This exercises validation, provider acquisition, processing, ingest, SOFA verification, and output conversion. BaseDataset verifies the SOFA convention and prints diagnostics that are useful while implementing a Dataset, including during agent-assisted coding.

In your contribution notes, include the command or Python snippet you ran and the smallest useful evidence that the retrieved data is correct, such as expected filenames, dimensions, sampling rate, coordinates, or Dataset metadata.