Inference Endpoints (dedicated) documentation

Custom Router

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Custom Router

The custom router feature lets you fully customize your load balancing strategy. It’s an advanced feature, but gives you very precise control over each routing decision.

This is useful for example when you need:

  • Queue-based routing to avoid wasting newly scaled-up replicas on burst traffic.
  • Latency-aware routing which avoids sending requests to overloaded replicas.
  • Other custom strategies such as sticky sessions or weighted routing.

How it works

The custom router feature lets you deploy your own router that runs next to each replica. When a request comes in, it always goes to the router on the oldest replica, the leader. That router decides which replica should handle the request: it can forward it to another replica over the internal network, or hand it to the replica running right next to it.

                         (oldest replica is the leader)
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” 
                         β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    either  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                     β”Œβ”€β”€β”€β”‚β–Ίβ”‚  Router  │──────┬────►│ Replica 1  β”‚ β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
    β”‚   Request   β”‚β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          or
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” 
                         β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                         β”‚ β”‚  Router  β”‚      └────►│ Replica 2  β”‚ β”‚
                         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Whenever the set of replicas changes, for example on scale-up, scale-down, or rolling update, the platform sends the router an updated list of replica addresses by calling POST /_custom_router/set-backends. The router uses this to keep its view of the available replicas up to date and route accordingly.

When `customRouter` is enabled, the replicas of your endpoint are automatically allowed to reach each other over a private internal network, so the router can forward a request to any other replica. This network is scoped to your endpoint's replicas only; it **does not** expose them to anything else.

Router contract

Any HTTP server that implements the following two endpoints can be used as a custom router image:

Method Path Purpose
POST /_custom_router/set-backends The current list of replica addresses the router may forward to: {"backends": ["http://<host>:<port>", ...]}
GET /_custom_router/health Health check: return 200 when the router itself is ready to serve traffic

The router listens on the port set by the customRouter.port field (default 3000). Every other request path is treated as a user request to be proxied.

Enabling the custom router

Currently, you can only enable the custom router through the API. On endpoint creation or update, add a top-level customRouter object with these fields:

  • tag (required): the Docker image reference for your router, e.g. your-org/your-router-image:1.0.0.
  • env (optional): a map of environment variables passed to the router.
  • port (optional): the port your router listens on. Defaults to 3000.

For example:

curl -X POST "https://api.endpoints.huggingface.cloud/v2/endpoint/$NAMESPACE" \
    -H "Authorization: Bearer $HF_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
      "name": "my-endpoint",
      "type": "private",
      "provider": { "vendor": "aws", "region": "us-east-1" },
      "compute": {
        "accelerator": "gpu",
        "instanceType": "nvidia-l40s",
        "instanceSize": "x1",
        "scaling": {
          "minReplica": 0,
          "maxReplica": 4,
          "scaleToZeroTimeout": 15,
          "measure": {"pendingRequests": 2}
        }
      },
      "model": {
        "repository": "black-forest-labs/FLUX.1-schnell",
        "task": "text-to-image",
        "framework": "pytorch",
        "image": { "huggingface": {} }
      },
      "customRouter": {
        "tag": "ghcr.io/huggingface/endpoints-custom-routers/queued-least-latency:1.0.0",
        "env": {
          "MY_VAR": "value"
        }
      }
    }'

To remove the custom router on update, send customRouter with a null or absent tag:

{ "customRouter": {} }
When `customRouter` is set, the `loadBalancer` field in `experimentalFeatures` is ignored.

Reference implementation: queued-least-latency

When an endpoint scales up under load, new replicas take time to start and load the model. Meanwhile requests pile up, and with naive routing they keep queuing on the busy replicas instead of moving to the new ones once they’re ready. This hurts most for slow, non-batching workloads like image generation, where each replica handles one request at a time.

The endpoints-custom-routers repository provides a ready-to-use router, queued-least-latency, that fixes this: it queues incoming requests and sends each to the least-loaded replica, so new capacity is used as soon as it’s ready.

The docker image for the queued-least-latency is: ghcr.io/huggingface/endpoints-custom-routers/queued-least-latency:1.0.0

Routing strategy

  • Incoming requests are pushed onto an in-memory FIFO queue.
  • A dispatcher picks the replica with the lowest EWMA latency that is still under a configurable threshold.
  • Replicas that have never been tried are treated as latency 0 and picked first, so new capacity is used immediately on scale-up.
  • If all replicas are above the threshold, the dispatcher holds the request in the queue until a replica becomes available or the per-request timeout elapses.
  • After each response, the measured end-to-end latency (time to last byte) is fed back into the EWMA for that replica.

Configuration

All settings are environment variables, passed via the env field in the customRouter config.

Routing

Routing decides which replica each request goes to. The knob you’ll usually tune is CUSTOM_ROUTER_LATENCY_THRESHOLD.

Variable Default What it does
CUSTOM_ROUTER_LATENCY_THRESHOLD 3.0 A replica is treated as β€œloaded” once its average latency exceeds this many seconds; new requests stop going to it until it recovers.
CUSTOM_ROUTER_EWMA_ALPHA 0.3 How quickly that latency average reacts to recent requests (0 to 1). Higher is more reactive, lower is smoother. Rarely needs changing.

The latency threshold determines when a replica is considered loaded and new requests should queue at the router instead. The right value depends on whether your model benefits from batching:

  • If batching does not increase your throughput (or if you prioritize latency over throughput), set a low threshold. Once a replica’s EWMA exceeds it, the router stops sending more requests there and queues them instead. Each replica handles one request at a time, which is exactly what you want when batching brings no benefit.
  • If batching does increase your throughput (e.g. LLMs), you may want a higher threshold so that multiple requests can accumulate on a replica and be processed together. This trades some per-request latency for better overall throughput.

For a diffusion / image generation model (e.g. ~40 s per request, no batching benefit), the default threshold (3.0 s) works well. After the first completed request the EWMA is ~40 s, which immediately exceeds 3 s and causes new arrivals to queue at the router. Each replica handles one request at a time, which is fine since batching would not help throughput. Only tune queue depth and timeout:

"customRouter": {
  "tag": "ghcr.io/huggingface/endpoints-custom-routers/queued-least-latency:1.0.0`",
  "env": {
    "CUSTOM_ROUTER_QUEUE_MAX_SIZE": "200",
    "CUSTOM_ROUTER_QUEUE_TIMEOUT": "300"
  }
}

For an LLM (batching increases throughput), replicas can process several concurrent requests, but at a latency cost. With the default 3.0 s threshold, a replica would be flagged as loaded after its very first request and forced to handle requests one at a time. Raise the threshold to let replicas accumulate concurrent requests; how high depends on how much latency you are willing to trade for throughput gains:

"customRouter": {
  "tag": "ghcr.io/huggingface/endpoints-custom-routers/queued-least-latency:1.0.0`",
  "env": {
    "CUSTOM_ROUTER_LATENCY_THRESHOLD": "65.0",
    "CUSTOM_ROUTER_QUEUE_MAX_SIZE": "2000"
  }
}

Queue

The queue configuration decides what happens when every replica is loaded. These are the router’s backpressure controls: when the queue is full, the oldest waiting request is dropped with 503 Service Unavailable, and any request that waits longer than the timeout is also dropped with 503.

Variable Default What it does
CUSTOM_ROUTER_QUEUE_MAX_SIZE 1000 Maximum requests held in the queue; beyond this the oldest waiting request is dropped with 503.
CUSTOM_ROUTER_QUEUE_TIMEOUT 1200 How long (seconds) a request may wait in the queue before being dropped with 503.

Operational

Settings that don’t affect routing.

Variable Default What it does
CUSTOM_ROUTER_PORT 3000 Port the router listens on. Must match the customRouter.port you set in the API.
CUSTOM_ROUTER_STATE_LOG_INTERVAL 30 Seconds between log lines reporting per-replica state.

Recommended values

Use this table as a quick reference for tuning queued-least-latency to your needs:

Scenario Recommended approach
Burst traffic + autoscaling Use queued-least-latency: queued requests drain to new replicas on scale-up
Heterogeneous replica performance Use queued-least-latency: EWMA routing avoids slow replicas
Diffusion / no batching benefit Keep CUSTOM_ROUTER_LATENCY_THRESHOLD at default (3.0 s): one request at a time per replica
LLM / batching increases throughput Raise CUSTOM_ROUTER_LATENCY_THRESHOLD to allow concurrent requests per replica

Metrics

The reference implementation exposes Prometheus metrics at GET /_custom_router/metrics:

Metric Type Description
custom_router_queue_depth Gauge Requests currently waiting in the queue
custom_router_backend_ewma_latency_seconds Gauge EWMA latency per replica (addr label)
custom_router_backend_inflight_requests Gauge In-flight requests per replica
custom_router_requests_dispatched_total Counter Requests successfully forwarded
custom_router_requests_evicted_total Counter Requests dropped due to full queue
custom_router_requests_timeout_total Counter Requests dropped due to queue timeout

GET /_custom_router/health also returns a JSON snapshot of current queue depth and per-replica EWMA stats, useful for debugging.

Building your own router

queued-least-latency is a starting point, not a requirement. If its routing strategy doesn’t fit your use case (say you need KV-cache-aware routing, proactive scale-up signals, or sticky sessions), you can build your own. Any image that satisfies the Router contract can be dropped in as the customRouter image. A minimal skeleton in Go:

// POST /_custom_router/set-backends
// Called by the platform every time the set of replicas changes.
func handleSetBackends(w http.ResponseWriter, r *http.Request) {
    var payload struct {
        Backends []string `json:"backends"` // e.g. ["http://10.0.0.1:8080", ...]
    }
    json.NewDecoder(r.Body).Decode(&payload)
    updateMyBackendList(payload.Backends)
    json.NewEncoder(w).Encode(map[string]bool{"ok": true})
}

// GET /_custom_router/health
// Return 200 when ready to serve traffic.
func handleHealth(w http.ResponseWriter, r *http.Request) {
    json.NewEncoder(w).Encode(map[string]bool{"ok": true})
}

// Everything else is a user request to forward.
func handleProxy(w http.ResponseWriter, r *http.Request) {
    backend := pickBackend() // your routing logic here
    forwardTo(w, r, backend)
}

The full queued-least-latency source in the endpoints-custom-routers repository is a good template to fork.

Update on GitHub