ML STAC Collection Specification 🤖

Overview 📜

This document explains the structure and content of the ML Collection extension. An ML-STAC Collection is simply an STAC Collection with additional fields (i.e. extensions). The STAC Collection Specification defines a set of common fields to describe a group of Items that share properties and metadata. These fields encompass vital information about the tensor structure, the tasks, the providers, the data quality, etc. Most of this fields are estimated automatically from the ML-STAC Items, but others are required to be manually configured by the data provider. It is important to note that the ML-STAC Catalog and ML-STAC Item do not comply with the STAC specification, in favour of having a structure that is more intuitive and efficient in ML workflows. To create an ML-STAC Collection, you only need to comply with both the STAC Collection and ML-STAC Collection REQUIRED fields:

About the STAC fields 🛠️

These fields are obtained from the STAC Collection specification v1.0.0-rc.4. For further details on how to create an STAC Collection, please refer to the STAC Collection specification

Element	Type	Description
type	string	REQUIRED. Must be set to `Collection` to be a valid Collection.
stac_version	string	REQUIRED. The STAC version the Collection implements.
id	string	REQUIRED. A unique identifier string.
description	string	REQUIRED. Detailed multi-line description to fully explain the Collection. CommonMark 0.29 syntax MAY be used for rich text representation.
license	string	REQUIRED. Collection's license(s), either an SPDX License identifier, `various` if multiple licenses apply or `proprietary` for all other cases.
extent	Extent Object	REQUIRED. Spatial and temporal extents.
links	Link Object	REQUIRED. A list of references to other documents.
keywords	[string]	OPTIONAL. List of keywords describing the Collection.
providers	Provider Object	OPTIONAL. A list of providers, which may include all organizations capturing or processing the data or the hosting provider. Providers should be listed in chronological order with the most recent provider being the last element of the list.
keywords	[string]	OPTIONAL. List of keywords describing the Collection.

Additional Field Information 📄

stac_version 📜

In general, STAC versions can be mixed, but please keep the recommended best practices in mind.

stac_extensions 🧩

The ML STAC Collection depends on the following extensions:

File extension: Provides a way to specify file-related details such as checksum and size for the dataset.
Datacube extension: Provides a way to specify the tensor dimensions.
classification extension: Provides a way to specify the classification scheme of the dataset. Useful for classification, object detection, semantic segmentation, instance segmentation, and panoptic segmentation tasks.
projection extension: Provides a way to specify the projection of the dataset.

Extent Object 🌍

More information about this field can be found in the STAC Collection specification Extent Object section.

Link Object 🔗

ML-STAC Collection does not need to define a relashionship with another entity. Therefore the rel field must be set to self. A detailed description of the Link Object can be found in the STAC Collection specification.

Provider Object 🏢

More information about this field can be found in the STAC Collection specification Provider Object section.

About the ML-STAC fields 🤖

The ML-STAC collections place an extra field on the traditional STAC Collection. All the ML-STAC fields are prefixed with ml:. The following is a list of required and optional fields:

Key	Type	Description
ml:mlstac_version	str	REQUIRED. The ML-STAC version the Collection implements.
ml:train	MLSTAC-Catalog Object	REQUIRED. Represents the training dataset as an ML-STAC Catalog.
ml:validation	MLSTAC-Catalog Object	REQUIRED. Represents the validation dataset as an ML-STAC Catalog.
ml:test	MLSTAC-Catalog Object	REQUIRED. Represents the testing dataset as an ML-STAC Catalog.
ml:curator	Contact Object	REQUIRED. A list of contacts qualified by their role.
ml:task	Task Object	REQUIRED. The task is a string chosen from an explicit list of names.
ml:split_strategy	Split Object	OPTIONAL. The split strategy is a string chosen from an explicit list of method names.
ml:discuss	HyperLink Object	STRONGLY RECOMMENDED. Each dataset should have a link to report issues with respect to the curation of the dataset or to discuss the dataset in general. This link usually points to a GitHub Repository Issue Tracker.
ml:raw	HyperLink Object	STRONGLY RECOMMENDED. Each dataset should have a link to the original raw data.
ml:extra_metadata	HyperLink Object	OPTIONAL. Each ML dataset can have a link to additional metadata. The `extra_metadata` must be a link to a Apache Parquet file. The number of rows in the Parquet file must be equal to the number of Items in the ML-STAC Collection. The Parquet file must contain the column named `id` that contains the Item IDs and the column named `split` that contains the split name of each Item. The Parquet file can contain any number of columns higher than 2.

Additional Field Information 📄

Contact Object 📇

A detailed description of the Contact Object can be found in the STAC Contact Extension.

Task Object 🎯

In the ML-STAC specification, the task field is a string chosen from a explicit list of names. Each task represent a different ML problem with a specific input and target tensor properties. The list of tasks that the current ML-STAC specification supports are:

TensorClassification: Convert a N-dimensional tensor into a single or multi-class categorical label.
TensorRegression: Convert a N-dimensional tensor into a single or multi-dimensional continuous label.
TensorObjectDetection: Convert a N-dimensional tensor into a list of bounding boxes and categorical class labels.
TensorSemanticSegmentation: Convert a N-dimensional tensor into a single or multi-class categorical label for each pixel.
TensorInstanceSegmentation: Convert a N-dimensional tensor into a list of bounding boxes and categorical class labels that are unique for each instance.
TensorPanopticSegmentation: Convert a N-dimensional tensor into a list of bounding boxes and categorical class labels that are unique for each instance and a single or multi-class categorical label for each pixel.
TensortoTensor: Convert a N-dimensional tensor into another N-dimensional tensor.
TensortoText: Convert a N-dimensional tensor into a string.
TexttoTensor: Convert a string into a N-dimensional tensor.

Split Object 🔄

The split_strategy field is a string that represent a different way to split the dataset into training, validation and testing subsets. The list of split strategies that the current ML-STAC specification supports are:

random: Randomly split the dataset into training, validation and testing subsets.
stratified: Split considering a specific dataset property. For instance, considering a temporal or spatial property, such as the season or the location.
systematic: Split considering a evenly spaced pattern.
other: Split considering a custom pattern.

HyperLink Object 🔖

The HyperLink Object is a simple JSON that contains a URL. The URL is validated using the rfc3986 library. The HyperLink Object is composed of two fields: href and description. The href is the URL of the link and the description is a string that describes the link. The description field is OPTIONAL. The HyperLink Object is used in the ML-STAC specification to provide links to the raw data and to the issue tracker of the dataset.

About the AUTOMATIC fields 🔄

The ML-STAC Collection Pydantic data model automatically generates fields based on the REQUIRED fields.

Key	Type	Description
ml:dtype	Dtype Object	This field specifies the data type of the ItemTensors. (refer to the ML Item spec)
ml:shape	Shape Object	This field specifies the shape of the ItemTensors.
ml:offsets	Offsets Object	This field indicates the span of binaries corresponding to the ItemTensors.
ml:documentation	string	Based on all REQUIRED ML-STAC fields the ML-STAC Collection Pydantic data model generates a documentation string that contains the information of the ML-STAC Collection.
file:size	float	The file size, specified in bytes.
file:checksum	string	Provides a way to specify file checksums (e.g. BLAKE2, MD5, SHA1, SHA2, SHA3). The hashes are self-identifying hashes as described in the Multihash specification and must be encoded as hexadecimal (base 16) string with lowercase letters.

Additional Field Information 📄

Dtype Object 🎲

This field specifies the data type of the input, target, and extra ItemTensors. The field is a dictionary composed of three keys: input, target, and extra. Each key is associated with a string object that represents the data type of the ItemTensor.

For instance, a valid dtype object is:

{
    "input": "float32",
    "target": "int64",
    "extra": "str"
}

The dtype that the ML-STAC specification supports are:

float16: 16-bit floating point number.
float32: 32-bit floating point number.
float64: 64-bit floating point number.
int8: 8-bit signed integer.
int16: 16-bit signed integer.
int32: 32-bit signed integer.
uint8: 8-bit unsigned integer.
uint16: 16-bit unsigned integer.
bool: Boolean.
str: String.

The data type between all the ItemTensor objects in the same Catalog must be consistent. If the input ItemTensor is None, the dtype['input'] is automatically configured to 'str'. Similarly, if the target ItemTensor is None and the target ItemMetadata is a string, then, the dtype['target'] is automatically configured to 'str'. However, if the target ItemTensor and ItemMetadata are both None, then, dtype['target'] is configured as None as well. Finally, if the extra ItemTensor is None, the dtype['extra'] is automatically configured to None.

Shape Object 📏

This field specifies the shape of the input, target, and extra ItemTensors. The field is a dictionary composed of three keys: input, target, and extra. Each key is associated with a list of integers that represents the shape of the Item tensor. For instance, a valid shape object is:

{
    "input": [3, 256, 256],
    "target": [256, 256],
    "extra": null
}

None is assigned when the ItemTensor is set to None.

Offsets Object 🎛️

This field indicates the span of binaries corresponding to the input, target, and extra ItemTensors. The field is a dictionary composed of three keys: input, target, and extra. Each key is associated with a list of integers of length 2 that represents the binary range of the Item tensor. For instance, a valid offsets object is:

{
    "input": [0, 1000000],
    "target": [1000000, 2000000],
    "extra": null
}

The first value of the list is the beginning of the binary range (BEGIN) and the second value is the end of the binary range (END). The binary range is inclusive. For instance, the binary range [0, 1000000] contains 1000001 bytes. The offsets object is used to read the ItemTensors in a lazy manner.