text-generation-inference documentation
Safetensors
Getting started
Text Generation InferenceQuick TourSupported ModelsUsing TGI with Nvidia GPUsUsing TGI with AMD GPUsUsing TGI with Intel GaudiUsing TGI with AWS Trainium and InferentiaUsing TGI with Google TPUsUsing TGI with Intel GPUsInstallation from sourceMulti-backend supportInternal ArchitectureUsage Statistics
Tutorials
Consuming TGIPreparing Model for ServingServing Private & Gated ModelsUsing TGI CLIDeploying on AWS (EC2 and SageMaker)Non-core Model ServingSafetyUsing Guidance, JSON, toolsVisual Language ModelsMonitoring TGI with Prometheus and GrafanaTrain Medusa
Backends
Reference
Conceptual Guides
Safetensors
Safetensors is a model serialization format for deep learning models. It is faster and safer compared to other serialization formats like pickle (which is used under the hood in many deep learning libraries).
TGI depends on safetensors format mainly to enable tensor parallelism sharding. For a given model repository during serving, TGI looks for safetensors weights. If there are no safetensors weights, TGI converts the PyTorch weights to safetensors format.
You can learn more about safetensors by reading the safetensors documentation.
Update on GitHub