Adding a new Dataset#
A new Dataset should fit into the shared BaseDataset get() flow
rather than implementing its own retrieval pipeline. The Dataset-specific code covers three
responsibilities: acquiring provider data, preparing a single file for the ingest stage,
and reading that file into the internal SOFA representation. See
get processing flow for the full list of extension points.
Choose a base class#
Always inherit from BaseDataset unless a more specific base class applies.
Introduce a new Dataset Family class only when at least two Datasets share dataset-specific steps
in the get pipeline.
If the provider data is already SOFA-native, consider inheriting from
SofaBaseDataset. SofaBaseDataset preserves the same shared flow but
avoids unnecessary SOFA output rewrites when output_format="sofa" is requested.
Implement the Dataset class#
A skeletal Dataset looks like this:
from pathlib import Path
import sofar as sf
from irdl.base import BaseDataset
class NewDataset(BaseDataset):
"""Retrieve and process the NEW Dataset."""
name = "newDataset"
doi = "10.xxxx/example"
@classmethod
def get(
cls,
cache_dir: str | None = None,
export_dir: str | None = None,
output_format: str = "pyfar",
*,
scenario: str = "default",
):
"""
Parameters
----------
scenario : str
Dataset-specific scenario to retrieve.
"""
return cls()._get(
cache_dir=cache_dir,
export_dir=export_dir,
output_format=output_format,
scenario=scenario,
)
def _validate_params(self, **dataset_kwargs) -> None:
scenario = dataset_kwargs["scenario"]
if scenario not in {"default"}:
raise ValueError("scenario must be 'default'")
def _source_filename(self, **dataset_kwargs) -> str:
scenario = dataset_kwargs["scenario"]
return f"new-{scenario}.sofa"
def _download(self, provider_dir: Path, **dataset_kwargs) -> Path:
# Retrieve provider file(s) into provider_dir.
# Return the primary provider artifact.
# Note: The public download() method is a wrapper that calls this _download() method.
raise NotImplementedError
def _process(self, provider_artifact: Path, ingest_path: Path, **dataset_kwargs) -> Path:
# Optional: extract, merge, rename, or convert provider data into ingest_path.
# If the provider artifact is already a single ingest-ready file, the
# BaseDataset implementation may be enough and this override can be removed.
# Note: The public process() method is a wrapper that calls this _process() method.
raise NotImplementedError
def _ingest(self, ingest_path: Path) -> sf.Sofa:
# Read ingest_path and return a sofar.Sofa object.
raise NotImplementedError
Keep this template intentionally small. Do not copy processing logic from another Dataset unless the new Dataset has the same provider format and needs the same transformation.
Update public API and CLI#
After implementing the class, add it to src/irdl/__init__.py. This exposes the Dataset
as part of the public API:
from .new_module import NewDataset as NewDataset
The CLI is generated automatically from concrete BaseDataset
subclasses imported by irdl. Do not add hand-written CLI code for a new Dataset. The CLI
command name comes from Dataset.name; parameters and help text come from the typed
get() signature and NumPy-style docstring. See Python API for API
links.
Document the Dataset#
Dataset documentation is auto-generated by running make generated-docs in the
docs directory, which executes docs/generate_dataset_docs.py. This script:
Discovers all Dataset implementations from the
irdlmoduleGroups them by their
_categoryattribute (fromDatasetCategory)Generates category pages and individual dataset pages
To categorize your Dataset, set the _category class attribute to one of the
DatasetCategory values:
from irdl.base import DatasetCategory
class NewDataset(BaseDataset):
_category = DatasetCategory.ROOM_IMPULSE_RESPONSES # or HEAD_RELATED_IMPULSE_RESPONSES
If no handwritten prose is needed, the auto-generated page will contain just the
class documentation. To add custom documentation, create an .rst fragment in
docs/datasets/ named after the Dataset name attribute (for example,
name = "new" maps to docs/datasets/new.rst). The generated Dataset page
lives under docs/_generated/datasets/ and automatically includes that fragment.
See existing dataset files for examples.
The documentation Makefile regenerates the dataset docs and CLI help while building the docs. Maintainers can refresh generated docs during review by running:
$ uv run make -C docs html
Manual verification and Evidence#
Final verification should use the public get() path, not a direct private method call.
For example:
from irdl import NewDataset
path = NewDataset.get(..., output_format="sofa")
print(path)
This exercises validation, provider acquisition, processing, ingest, SOFA verification, and
output conversion. BaseDataset verifies the SOFA convention and prints diagnostics that
are useful while implementing a Dataset, including during agent-assisted coding.
In your contribution notes, include the command or Python snippet you ran and the smallest useful evidence that the retrieved data is correct, such as expected filenames, dimensions, sampling rate, coordinates, or Dataset metadata.