.. _adding_new_dataset_heading:

Adding a new Dataset
====================

A new Dataset should fit into the shared :class:`~irdl.base.BaseDataset` ``get()`` flow
rather than implementing its own retrieval pipeline. The Dataset-specific code covers three
responsibilities: acquiring provider data, preparing a single file for the ``ingest`` stage,
and reading that file into the internal SOFA representation. See
:ref:`get-processing-flow` for the full list of extension points.

Choose a base class
-------------------

Always inherit from :class:`~irdl.base.BaseDataset` unless a more specific base class applies.
Introduce a new Dataset Family class only when at least two Datasets share dataset-specific steps 
in the ``get`` pipeline.

If the provider data is already SOFA-native, consider inheriting from
:class:`~irdl.sofa.SofaBaseDataset`. :class:`~irdl.sofa.SofaBaseDataset` preserves the same shared flow but
avoids unnecessary SOFA output rewrites when ``output_format="sofa"`` is requested.

Implement the Dataset class
---------------------------

A skeletal Dataset looks like this:

.. code-block:: python

   from pathlib import Path

   import sofar as sf

   from irdl.base import BaseDataset


   class NewDataset(BaseDataset):
       """Retrieve and process the NEW Dataset."""

       name = "newDataset"
       doi = "10.xxxx/example"

       @classmethod
       def get(
           cls,
           cache_dir: str | None = None,
           export_dir: str | None = None,
           output_format: str = "pyfar",
           *,
           scenario: str = "default",
       ):
           """

           Parameters
           ----------
           scenario : str
               Dataset-specific scenario to retrieve.
           """
           return cls()._get(
               cache_dir=cache_dir,
               export_dir=export_dir,
               output_format=output_format,
               scenario=scenario,
           )

       def _validate_params(self, **dataset_kwargs) -> None:
           scenario = dataset_kwargs["scenario"]
           if scenario not in {"default"}:
               raise ValueError("scenario must be 'default'")

       def _source_filename(self, **dataset_kwargs) -> str:
           scenario = dataset_kwargs["scenario"]
           return f"new-{scenario}.sofa"

       def _download(self, provider_dir: Path, **dataset_kwargs) -> Path:
           # Retrieve provider file(s) into provider_dir.
           # Return the primary provider artifact.
           # Note: The public download() method is a wrapper that calls this _download() method.
           raise NotImplementedError

       def _process(self, provider_artifact: Path, ingest_path: Path, **dataset_kwargs) -> Path:
           # Optional: extract, merge, rename, or convert provider data into ingest_path.
           # If the provider artifact is already a single ingest-ready file, the
           # BaseDataset implementation may be enough and this override can be removed.
           # Note: The public process() method is a wrapper that calls this _process() method.
           raise NotImplementedError

       def _ingest(self, ingest_path: Path) -> sf.Sofa:
           # Read ingest_path and return a sofar.Sofa object.
           raise NotImplementedError

Keep this template intentionally small. Do not copy processing logic from another Dataset
unless the new Dataset has the same provider format and needs the same transformation.

Update public API and CLI
-------------------------

After implementing the class, add it to ``src/irdl/__init__.py``. This exposes the Dataset
as part of the public API:

.. code-block:: python

   from .new_module import NewDataset as NewDataset

The CLI is generated automatically from concrete :class:`~irdl.base.BaseDataset`
subclasses imported by ``irdl``. Do not add hand-written CLI code for a new Dataset. The CLI
command name comes from ``Dataset.name``; parameters and help text come from the typed
``get()`` signature and NumPy-style docstring. See :doc:`/reference/python_api` for API
links.

Document the Dataset
--------------------

Dataset documentation is auto-generated by running ``make generated-docs`` in the
docs directory, which executes ``docs/generate_dataset_docs.py``. This script:

- Discovers all Dataset implementations from the ``irdl`` module
- Groups them by their ``_category`` attribute (from :class:`~irdl.base.DatasetCategory`)
- Generates category pages and individual dataset pages

To categorize your Dataset, set the ``_category`` class attribute to one of the
:class:`~irdl.base.DatasetCategory` values:

.. code-block:: python

   from irdl.base import DatasetCategory

   class NewDataset(BaseDataset):
       _category = DatasetCategory.ROOM_IMPULSE_RESPONSES  # or HEAD_RELATED_IMPULSE_RESPONSES

If no handwritten prose is needed, the auto-generated page will contain just the
class documentation. To add custom documentation, create an ``.rst`` fragment in
``docs/datasets/`` named after the Dataset ``name`` attribute (for example,
``name = "new"`` maps to ``docs/datasets/new.rst``). The generated Dataset page
lives under ``docs/_generated/datasets/`` and automatically includes that fragment.
See existing dataset files for examples.

The documentation Makefile regenerates the dataset docs and CLI help while building the
docs. Maintainers can refresh generated docs during review by running:

.. code-block:: console

   $ uv run make -C docs html

Manual verification and Evidence
--------------------------------

Final verification should use the public ``get()`` path, not a direct private method call.
For example:

.. code-block:: python

   from irdl import NewDataset

   path = NewDataset.get(..., output_format="sofa")
   print(path)

This exercises validation, provider acquisition, processing, ingest, SOFA verification, and
output conversion. :class:`~irdl.base.BaseDataset` verifies the SOFA convention and prints diagnostics that
are useful while implementing a Dataset, including during agent-assisted coding.

In your contribution notes, include the command or Python snippet you ran and the smallest
useful evidence that the retrieved data is correct, such as expected filenames, dimensions,
sampling rate, coordinates, or Dataset metadata.