Rate limits

No hard per-key limits by default. Best-effort fair-use throttling, plus honest pass-through of upstream 429s with a real Retry-After.

The design philosophy: your money is the rate limit. If your balance is positive and your key is under its cost cap, we try to serve every request. When we can't — because Anthropic, OpenAI, Google, or the underlying vendor is throttling us — we return 429 with a real Retry-After header from the upstream, not a made-up number.

Default per-key limits

  • No hard RPS. New keys are not capped at some arbitrary requests-per-second. Sustained traffic just works.
  • No token-per-minute cap. If you want to burn through a million tokens per minute on Opus, that's fine as long as the balance covers it and upstream capacity is available.
  • Fair-use ceiling ≈ 500 RPS burst. The gateway begins to shed load if a single key exceeds ~500 requests/sec sustained for more than a minute. This is a soft ceiling — email sales@nimbusapi.net to lift it in advance of a spike.
  • Concurrency: up to 1,024 in-flight requests per key. Above that you'll see 429 with code=concurrency_exceeded.

Bursting rules

The gateway uses a leaky-bucket per key with generous burst allowance:

  • Burst bucket: 2,000 requests. If the bucket has capacity, the request goes through immediately.
  • Refill rate: 500 requests/sec.
  • Sustained throughput: 500 RPS indefinitely.
  • Burst duration: ~4 seconds of free bursting at 1,000 RPS from a full bucket before throttling kicks in.

When you'll see 429

  1. Upstream throttled us. Anthropic or OpenAI returned 429 because our aggregate is over their per-org limit. This is the common case during viral moments. We pass through their Retry-After.
  2. Upstream 5xx cascade. The vendor is having an outage. We retry once internally with jitter, then fall through to a configured fallback model in the same class if you registered one, otherwise 429.
  3. Fair-use burst exceeded. code=rate_limit_exceeded. Slow down or contact us to lift the soft cap.
  4. Concurrency exceeded. Too many in-flight requests for the same key.

Retry-After header format

Every 429 response includes an HTTP-compliant Retry-After header. It is always a decimal number of seconds (never an HTTP-date). We also mirror the upstream's rate-limit headers for observability:

http
HTTP/1.1 429 Too Many Requests
retry-after: 3
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 2026-07-01T14:32:21Z
content-type: application/json

{
  "error": {
    "type": "rate_limit_error",
    "code": "upstream_throttled",
    "message": "Upstream provider (anthropic) returned 429. Retry in 3s.",
    "upstream_provider": "anthropic",
    "upstream_status": 429,
    "retry_after_seconds": 3
  }
}

Recommended retry pattern

Respect Retry-After if it's present, fall back to exponential backoff with jitter otherwise, and cap total attempts:

ts
async function callWithRetry(
  endpoint: string,
  body: unknown,
  maxAttempts = 5,
): Promise<Response> {
  let attempt = 0;
  while (true) {
    const res = await fetch(endpoint, {
      method: "POST",
      headers: {
        "content-type": "application/json",
        authorization: `Bearer ${process.env.NIMBUS_API_KEY}`,
      },
      body: JSON.stringify(body),
    });

    if (res.status !== 429 || attempt >= maxAttempts) return res;

    const retryAfter = Number(res.headers.get("retry-after") ?? "1");
    const backoff = Math.min(retryAfter * 1000, 2 ** attempt * 500);
    await new Promise((r) => setTimeout(r, backoff + Math.random() * 250));
    attempt++;
  }
}

Multi-provider fallback

If you register a fallback chain on your key (in /dashboard/keys), a 429 or 5xx from the primary model auto-falls through to the next model in the chain and only returns 429 to you when the whole chain is exhausted. Popular chains:

  • Opus fallback: claude-opus-4.8 → claude-opus-4.7 → gpt-5.5
  • Coding fallback: gpt-5.3-codex → deepseek-v4-pro → qwen3-coder
  • Cheap fallback: gemini-3-flash-preview → gpt-5.4-mini

Higher limits

If you plan to sustain more than 500 RPS or need pre-allocated capacity on a specific model during a launch, email sales@nimbusapi.net with your expected RPS, TPM, and the model. We can pre-reserve upstream capacity with 48 hours notice for accounts over $5k/mo.

See also Cost caps for the per-key budget ceiling (separate from RPS limits).