Machine Learning SpatioTemporal Asset Catalogs (ML-STAC) 🤖🌐
The ML-STAC
specification is composed of different Pydantic Models:
-
Tensor: Represents only ONE tensor type, chosen from
np.ndarray
,torch.Tensor
,jax.numpy.ndarray
,paddle.Tensor
, andtensorflow.Tensor
. -
SampleTensor: This is linked to the
Tensor
and includes three attributes:input
,target
, andextra
, all of them of the Tensor class. -
SampleMetadata: Contains metadata for samples with attributes like
input
,target
,extra
,geotransform
,crs
,id
, and date-time attributes (start_datetime
,end_datetime
). -
Sample: Core unit to this model, an
Sample
has atensor
of typeSampleTensor
andmetadata
of typeSampleMetadata
. -
Catalog: Comprises fields for the number of samples (
n_samples
) and a hyperlink (url
). Each Collection must define three Catalogs:train
,validation
, andtest
. A Catalog can link to multiple samples, while each sample is associated with only one of the Catalogs (train
,validation
, ortest
). -
Collection: A broader category that houses multiple catalogs (train, validation, and test). Attributes of a collection include its
name
, ML-STAC version, authorship information (authors
),licenses
,split_strategy
, etc. There are additional properties like the computer vision task (cv_task
),sensor
details, band information (bands
), and data type (dtype
) for the Samples. -
License and Licenses: Holds information about the licensing of the data, including the license name and a link. Multiple licenses can be grouped together, and additional comments can be attached.
-
Reviewers and Reviewer: These elements capture details about individuals or entities reviewing the dataset. Each reviewer has a name, a reference, a score, and an issue link. There can be multiple reviewers for a dataset.
-
Authors and Author: Detailed attributes for authors include a list of authors, details about who curated the data, and additional comments. Each author has a name, reference, and organizational affiliation.
-
Split: Indicates data splitting strategies such as random, stratified, cluster, systematic, or other.
-
Task, Sensor, Bands, and Dtype: These entities capture specifics about the machine learning task, the sensor used to gather the data, the bands of data, and the data type respectively.
-
Extent, SpatialExtent, and TemporalExtent: These components define the spatial and temporal coverage of the dataset. The spatial extent is defined by bounding boxes, while the temporal extent captures time intervals.
-
Hyperlink: This is a basic component that holds a URL link, utilized in various parts of the model.
The UML diagram below represents a cohesive structure, detailing how datasets in the ML-STAC
specification are organized, catalogued, and linked together.