Datamint Dataset

MLflow Dataset adapter for Datamint project splits.

class datamint.mlflow.data.datamint_dataset.DatamintDatasetSource(project_id, project_name, split, extra_params=None)

Bases: DatasetSource

Source info pointing to a Datamint project.

Parameters:
  • project_id (str)

  • project_name (str)

  • split (str | None)

  • extra_params (dict[str, Any] | None)

classmethod from_json(source_json)

Constructs an instance of the DatasetSource from a JSON string representation.

Parameters:

source_json (str) – A JSON string representation of the DatasetSource.

Return type:

DatamintDatasetSource

Returns:

A DatasetSource instance.

load(**kwargs)

Loads files / objects referred to by the DatasetSource. For example, depending on the type of DatasetSource, this may download source CSV files from S3 to the local filesystem, load a source Delta Table as a Spark DataFrame, etc.

Return type:

Any

Returns:

The downloaded source, e.g. a local filesystem path, a Spark DataFrame, etc.

Parameters:

kwargs (Any)

to_json()

Obtains a JSON string representation of the DatasetSource.

Return type:

str

Returns:

A JSON string representation of the DatasetSource.

class datamint.mlflow.data.datamint_dataset.DatamintMLflowDataset(project_id, project_name, split, resources, extra_params=None)

Bases: Dataset

MLflow Dataset wrapping a Datamint project split for lineage tracking.

Parameters:
  • project_id (str)

  • project_name (str)

  • split (str | None)

  • resources (Sequence[str] | Sequence[Resource])

  • extra_params (dict[str, Any] | None)

property profile: Any | None

Optional summary statistics for the dataset, such as the number of rows in a table, the mean / median / std of each table column, etc.

property schema

Optional dataset schema, such as an instance of mlflow.types.Schema representing the features and targets of the dataset.

to_dict()

Create config dictionary for the dataset.

Subclasses should override this method to provide additional fields in the config dict, e.g., schema, profile, etc.

Returns a string dictionary containing the following fields: name, digest, source, source type.

Return type:

dict[str, str]