Safetensor

What is Safetensor? 🤖

Safetensor is a format specialized in storing tensors for machine learning, standing out for its advanced technical features. Its significance becomes even more pronounced when considering its seamless integration with systems like ML-STAC, meeting essential storage and data access needs.

Why Safetensor Shines Among Tensor Formats 🤔

In the diverse world of tensor formats, Safetensor emerges as a specialized answer to the drawbacks seen with pickle in PyTorch. It distinguishes itself with unique features absent in many other formats.

Below is a comparison table that highlights the features of Safetensor against other widely-used tensor formats:

Format	Safe	Zero-copy	Lazy loading	No file size limit	Layout control	Flexibility	Bfloat16
pickle (PyTorch)	✗	✗	✗	🗸	✗	🗸	🗸
H5 (Tensorflow)	🗸	✗	🗸	🗸	~	~	✗
SavedModel (Tensorflow)	🗸	✗	✗	🗸	🗸	✗	🗸
MsgPack (flax)	🗸	🗸	✗	🗸	✗	✗	🗸
Protobuf (ONNX)	🗸	✗	✗	✗	✗	✗	🗸
Cap'n'Proto	🗸	🗸	~	🗸	🗸	~	✗
Arrow	?	?	?	?	?	?	✗
Numpy (npy,npz)	🗸	?	?	✗	🗸	✗	✗
pdparams (Paddle)	✗	✗	✗	🗸	✗	🗸	🗸
SafeTensors	🗸	🗸	🗸	🗸	🗸	✗	🗸

This table offers a snapshot of how Safetensor stands in comparison to other formats. It reveals its capabilities in ensuring safety, optimizing data load, and supporting various data attributes. Obtained from HuggingFace Safetensors Repository.

Why use Safetensor in ML-STAC? 🤔

Safetensors positions itself as the ideal format for storing Samples in ML-STAC for several reasons:

🔒 Security: Unlike formats like pickle, safetensors is designed to prevent the execution of unauthorized code, offering increased trust when working with data files.
🚀 Load Optimization: Its ability to load data without the need for additional copying (zero-copy) is paramount, especially in environments where memory is a precious resource.
💡 Lazy Loading: This feature allows users to review or access specific tensors in the file without loading it entirely.
📊 Design Control: Safetensors provides a structure that enables quick access to individual tensors.
📏 No Size Limit: It offers more flexibility by not setting a maximum file size, unlike protobuf.
🧠 Bfloat16 Support: It natively supports bfloat16, crucial in many machine learning scenarios.
🛡️ DOS Attack Prevention: Protection against potential Denial of Service attacks.
⚡ Load Speed: It stands out for its fast loading times, both on CPU and GPU.
📜 Apache-2.0 License: Encourages adoption within the open-source community.

Internal Format of safetensor 📘

To understand the internal structure of a Safetensor file (model.safetensors), we must first analyze its main components. Obtained from HuggingFace Safetensors Repository.

1. The Initial 8 bytes: 🚀

Upon opening the file, we are greeted with the first 8 bytes. These bytes are pivotal as they contain an integer (u64 int), designated as N, which tells us the size of the file's header (or header). In other words, N tells us how many bytes we should read next to extract all the header information.

2. The Header (N bytes): 📜

With the information garnered from the initial 8 bytes, we move on to read the next section of the file, taking up precisely N bytes. This section houses a string in JSON utf-8 string format, and this is where we find the file's header. This header is of utmost importance, acting as an "index" or "table of contents" for the file, giving insights about the stored tensors.

Header Details 🔍:

Tensor Names (TENSOR_NAME_1, TENSOR_NAME_2, ...): Unique identifiers for each tensor stored in the file.
dtype: The tensor's data type, e.g., "F16" for a 16-bit floating-point type. There's a note at the bottom of the diagram showcasing other possible data types like "F64", "F32", etc.
shape: 📐 The dimensions or shape of the tensor.
offsets: 📍 Indicates where each tensor starts and ends within the file. This provides a direct route to read specific data from each tensor.
metadata: 📌 Additional information that might be related to the file or the stored tensors.

3. Rest of the file 💾:

After reading the header, what follows is the rest of the file. The header serves as an index or table of contents, while the rest of the file holds the actual tensor data, which are essential and ready to be processed or utilized as needed.