ML STAC Collection Specification 🤖
Overview 📜
This document explains the structure and content of the ML Collection extension. An ML-STAC Collection is simply an STAC Collection with additional fields (i.e. extensions). The STAC Collection Specification defines a set of common fields to describe a group of Items that share properties and metadata. These fields encompass vital information about the tensor structure, the tasks, the providers, the data quality, etc. Most of this fields are estimated automatically from the ML-STAC Items, but others are required to be manually configured by the data provider. It is important to note that the ML-STAC Catalog
and ML-STAC Item
do not comply with the STAC specification, in favour of having a structure that is more intuitive and efficient in ML workflows. To create an ML-STAC Collection
, you only need to comply with both the STAC Collection and ML-STAC Collection REQUIRED fields:
About the STAC fields 🛠️
These fields are obtained from the STAC Collection specification v1.0.0-rc.4. For further details on how to create an STAC Collection
, please refer to the STAC Collection specification
Element | Type | Description |
---|---|---|
type | string | REQUIRED. Must be set to Collection to be a valid Collection. |
stac_version | string | REQUIRED. The STAC version the Collection implements. |
id | string | REQUIRED. A unique identifier string. |
description | string | REQUIRED. Detailed multi-line description to fully explain the Collection. CommonMark 0.29 syntax MAY be used for rich text representation. |
license | string | REQUIRED. Collection's license(s), either an SPDX License identifier, various if multiple licenses apply or proprietary for all other cases. |
extent | Extent Object | REQUIRED. Spatial and temporal extents. |
links | Link Object | REQUIRED. A list of references to other documents. |
keywords | [string] | OPTIONAL. List of keywords describing the Collection. |
providers | Provider Object | OPTIONAL. A list of providers, which may include all organizations capturing or processing the data or the hosting provider. Providers should be listed in chronological order with the most recent provider being the last element of the list. |
keywords | [string] | OPTIONAL. List of keywords describing the Collection. |
Additional Field Information 📄
stac_version 📜
In general, STAC versions can be mixed, but please keep the recommended best practices in mind.
stac_extensions 🧩
The ML STAC Collection
depends on the following extensions:
- File extension: Provides a way to specify file-related details such as checksum and size for the dataset.
- Datacube extension: Provides a way to specify the tensor dimensions.
- classification extension: Provides a way to specify the classification scheme of the dataset. Useful for classification, object detection, semantic segmentation, instance segmentation, and panoptic segmentation tasks.
- projection extension: Provides a way to specify the projection of the dataset.
Extent Object 🌍
More information about this field can be found in the STAC Collection specification Extent Object section.
Link Object 🔗
ML-STAC Collection
does not need to define a relashionship with another entity. Therefore the rel
field must be set to self
. A detailed description of the Link Object can be found in the STAC Collection specification.
Provider Object 🏢
More information about this field can be found in the STAC Collection specification Provider Object section.
About the ML-STAC fields 🤖
The ML-STAC collections place an extra field on the traditional STAC Collection. All the ML-STAC fields are prefixed with ml:
. The following is a list of required and optional fields:
Key | Type | Description |
---|---|---|
ml:mlstac_version | str | REQUIRED. The ML-STAC version the Collection implements. |
ml:train | MLSTAC-Catalog Object | REQUIRED. Represents the training dataset as an ML-STAC Catalog. |
ml:validation | MLSTAC-Catalog Object | REQUIRED. Represents the validation dataset as an ML-STAC Catalog. |
ml:test | MLSTAC-Catalog Object | REQUIRED. Represents the testing dataset as an ML-STAC Catalog. |
ml:curator | Contact Object | REQUIRED. A list of contacts qualified by their role. |
ml:task | Task Object | REQUIRED. The task is a string chosen from an explicit list of names. |
ml:split_strategy | Split Object | OPTIONAL. The split strategy is a string chosen from an explicit list of method names. |
ml:discuss | HyperLink Object | STRONGLY RECOMMENDED. Each dataset should have a link to report issues with respect to the curation of the dataset or to discuss the dataset in general. This link usually points to a GitHub Repository Issue Tracker. |
ml:raw | HyperLink Object | STRONGLY RECOMMENDED. Each dataset should have a link to the original raw data. |
ml:extra_metadata | HyperLink Object | OPTIONAL. Each ML dataset can have a link to additional metadata. The extra_metadata must be a link to a Apache Parquet file. The number of rows in the Parquet file must be equal to the number of Items in the ML-STAC Collection. The Parquet file must contain the column named id that contains the Item IDs and the column named split that contains the split name of each Item. The Parquet file can contain any number of columns higher than 2. |
Additional Field Information 📄
Contact Object 📇
A detailed description of the Contact Object can be found in the STAC Contact Extension.
Task Object 🎯
In the ML-STAC specification, the task
field is a string chosen from a explicit list of names. Each task represent a different ML problem with a specific input and target tensor properties. The list of tasks that the current ML-STAC specification supports are:
- TensorClassification: Convert a N-dimensional tensor into a single or multi-class categorical label.
- TensorRegression: Convert a N-dimensional tensor into a single or multi-dimensional continuous label.
- TensorObjectDetection: Convert a N-dimensional tensor into a list of bounding boxes and categorical class labels.
- TensorSemanticSegmentation: Convert a N-dimensional tensor into a single or multi-class categorical label for each pixel.
- TensorInstanceSegmentation: Convert a N-dimensional tensor into a list of bounding boxes and categorical class labels that are unique for each instance.
- TensorPanopticSegmentation: Convert a N-dimensional tensor into a list of bounding boxes and categorical class labels that are unique for each instance and a single or multi-class categorical label for each pixel.
- TensortoTensor: Convert a N-dimensional tensor into another N-dimensional tensor.
- TensortoText: Convert a N-dimensional tensor into a string.
- TexttoTensor: Convert a string into a N-dimensional tensor.
Split Object 🔄
The split_strategy
field is a string that represent a different way to split the dataset into training, validation and testing subsets. The list of split strategies that the current ML-STAC specification supports are:
- random: Randomly split the dataset into training, validation and testing subsets.
- stratified: Split considering a specific dataset property. For instance, considering a temporal or spatial property, such as the season or the location.
- systematic: Split considering a evenly spaced pattern.
- other: Split considering a custom pattern.
HyperLink Object 🔖
The HyperLink Object is a simple JSON that contains a URL. The URL is validated using the rfc3986 library. The HyperLink Object is composed of two fields: href
and description
. The href
is the URL of the link and the description
is a string that describes the link. The description
field is OPTIONAL. The HyperLink Object is used in the ML-STAC specification to provide links to the raw data and to the issue tracker of the dataset.
About the AUTOMATIC fields 🔄
The ML-STAC Collection Pydantic data model automatically generates fields based on the REQUIRED fields.
Key | Type | Description |
---|---|---|
ml:dtype | Dtype Object | This field specifies the data type of the ItemTensors. (refer to the ML Item spec) |
ml:shape | Shape Object | This field specifies the shape of the ItemTensors. |
ml:offsets | Offsets Object | This field indicates the span of binaries corresponding to the ItemTensors. |
ml:documentation | string | Based on all REQUIRED ML-STAC fields the ML-STAC Collection Pydantic data model generates a documentation string that contains the information of the ML-STAC Collection. |
file:size | float | The file size, specified in bytes. |
file:checksum | string | Provides a way to specify file checksums (e.g. BLAKE2, MD5, SHA1, SHA2, SHA3). The hashes are self-identifying hashes as described in the Multihash specification and must be encoded as hexadecimal (base 16) string with lowercase letters. |
Additional Field Information 📄
Dtype Object 🎲
This field specifies the data type of the input
, target
, and extra
ItemTensors. The field is a dictionary composed of three keys: input
, target
, and extra
. Each key is associated with a string object that represents the data type of the ItemTensor.
For instance, a valid dtype
object is:
{
"input": "float32",
"target": "int64",
"extra": "str"
}
The dtype that the ML-STAC specification supports are:
- float16: 16-bit floating point number.
- float32: 32-bit floating point number.
- float64: 64-bit floating point number.
- int8: 8-bit signed integer.
- int16: 16-bit signed integer.
- int32: 32-bit signed integer.
- uint8: 8-bit unsigned integer.
- uint16: 16-bit unsigned integer.
- bool: Boolean.
- str: String.
The data type between all the ItemTensor objects in the same Catalog must be consistent. If the input
ItemTensor is None, the dtype['input']
is automatically configured to 'str'. Similarly, if the target
ItemTensor is None and the target
ItemMetadata is a string, then, the dtype['target']
is automatically configured to 'str'. However, if the target
ItemTensor and ItemMetadata are both None, then, dtype['target']
is configured as None as well. Finally, if the extra
ItemTensor is None, the dtype['extra']
is automatically configured to None.
Shape Object 📏
This field specifies the shape of the input
, target
, and extra
ItemTensors. The field is a dictionary composed of three keys: input
, target
, and extra
. Each key is associated with a list of integers that represents the shape of the Item tensor. For instance, a valid shape
object is:
{
"input": [3, 256, 256],
"target": [256, 256],
"extra": null
}
None is assigned when the ItemTensor is set to None.
Offsets Object 🎛️
This field indicates the span of binaries corresponding to the input
, target
, and extra
ItemTensors. The field is a dictionary composed of three keys: input
, target
, and extra
. Each key is associated with a list of integers of length 2 that represents the binary range of the Item tensor. For instance, a valid offsets
object is:
{
"input": [0, 1000000],
"target": [1000000, 2000000],
"extra": null
}
The first value of the list is the beginning of the binary range (BEGIN) and the second value is the end of the binary range (END). The binary range is inclusive. For instance, the binary range [0, 1000000] contains 1000001 bytes. The offsets
object is used to read the ItemTensors in a lazy manner.