Model Fallback


Model fallback lets you define an ordered list of models to try in sequence. If the primary model is unavailable — due to a provider outage, rate limit, or capacity constraint — SolRouter automatically retries the request with the next model in the chain, completely transparently to your application.

Your code receives a successful response regardless of which model ultimately served it. The model field in the response tells you which one was used.


How fallback works

When you include a models array and "route": "fallback" in your request body, SolRouter follows this process:

  1. Attempt the request with the first model in the models array
  2. If that model returns a 5xx error, a timeout, or a capacity/rate-limit error, move to the next model
  3. Repeat until a model succeeds or the list is exhausted
  4. If all models fail, return the last error to your application

The entire retry chain happens server-side. There is no additional latency from round-trips between your application and SolRouter — only the latency of the provider calls themselves.

Your app
   │
   ▼
SolRouter receives request with models: [A, B, C] + route: "fallback"
   │
   ├─ Try model A ──► Provider A unavailable (503)
   │                        │
   ├─ Try model B ──────────┘ ──► Provider B rate limited (429)
   │                                     │
   └─ Try model C ──────────────────────┘ ──► Success ✓
         │
         ▼
   Response returned to your app

Configuring fallback

Add two fields to your standard chat completions request body:

  • models — an ordered array of model IDs to try, from highest to lowest preference
  • route — set to "fallback" to enable the failover behaviour

The top-level model field is still required for compatibility with OpenAI SDK clients. SolRouter treats it as the implicit first entry — the models array represents the fallback sequence that follows if the primary model fails.

Minimal example (raw JSON)

{
  "model": "openai/gpt-4o",
  "models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
  "route": "fallback",
  "messages": [
    { "role": "user", "content": "Summarise this document for me." }
  ]
}

In this request:

  1. SolRouter first tries openai/gpt-4o
  2. If that fails, it tries anthropic/claude-sonnet-4
  3. If that also fails, it tries google/gemini-2.5-flash
  4. If all three fail, the error from the last attempt is returned

Extended fallback chain

You can include as many models as you need. Longer chains provide more resilience at the cost of potentially higher latency in degraded scenarios.

{
  "model": "openai/gpt-4.1",
  "models": [
    "anthropic/claude-opus-4",
    "google/gemini-2.5-pro",
    "openai/gpt-4o",
    "anthropic/claude-sonnet-4",
    "google/gemini-2.5-flash",
    "meta-llama/llama-4-maverick"
  ],
  "route": "fallback",
  "messages": [
    { "role": "user", "content": "Analyse the attached quarterly report." }
  ]
}

TypeScript examples

Using the OpenAI SDK with extra body fields

The OpenAI SDK passes unknown body fields through to the API untouched, making it easy to add models and route without any custom logic.

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
  route: "fallback",
  messages: [{ role: "user", content: "What are the key risks in this contract?" }],
} as any);

// Check which model actually served the request
console.log("Served by:", completion.model);
console.log(completion.choices[0].message.content);

Using fetch directly

const response = await fetch("https://api.solrouter.io/ai/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SOLROUTER_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/gpt-4o",
    models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
    route: "fallback",
    messages: [{ role: "user", content: "What are the key risks in this contract?" }],
  }),
});

const data = await response.json();

// data.model tells you which model in the chain was used
console.log("Served by:", data.model);
console.log(data.choices[0].message.content);

With streaming

Fallback works transparently with streaming. SolRouter begins streaming from the first model that accepts the request — if a model fails before streaming starts, the fallback is tried before any tokens are sent to your client.

const stream = await client.chat.completions.create({
  model: "openai/gpt-4o",
  models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
  route: "fallback",
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant. Be concise.",
    },
    {
      role: "user",
      content: "Explain the CAP theorem in plain English.",
    },
  ],
  stream: true,
} as any);

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

Reusable helper function

import OpenAI, { type ClientOptions } from "openai";

interface FallbackCompletionOptions
  extends Omit<OpenAI.ChatCompletionCreateParamsNonStreaming, "model"> {
  model: string;
  models?: string[];
}

async function createWithFallback(
  client: OpenAI,
  options: FallbackCompletionOptions
): Promise<OpenAI.ChatCompletion> {
  return client.chat.completions.create({
    ...options,
    route: "fallback",
  } as any);
}

// Usage
const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await createWithFallback(client, {
  model: "openai/gpt-4o",
  models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
  messages: [{ role: "user", content: "Hello!" }],
});

Python examples

Using the OpenAI SDK

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "What are the key risks in this contract?"}],
    extra_body={
        "models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
        "route": "fallback",
    },
)

# Check which model served the request
print("Served by:", completion.model)
print(completion.choices[0].message.content)

Using httpx directly

import httpx
import os

response = httpx.post(
    "https://api.solrouter.io/ai/chat/completions",
    headers={
        "Authorization": f"Bearer {os.environ['SOLROUTER_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "model": "openai/gpt-4o",
        "models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
        "route": "fallback",
        "messages": [{"role": "user", "content": "What are the key risks in this contract?"}],
    },
    timeout=60.0,
)

data = response.json()
print("Served by:", data["model"])
print(data["choices"][0]["message"]["content"])

Streaming with fallback (Python)

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

stream = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Explain the CAP theorem in plain English."}],
    stream=True,
    extra_body={
        "models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
        "route": "fallback",
    },
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Reusable wrapper class (Python)

from openai import OpenAI
from typing import Any
import os


class FallbackClient:
    """OpenAI-compatible client with automatic model fallback."""

    def __init__(self, api_key: str, fallback_models: list[str]):
        self._client = OpenAI(
            base_url="https://api.solrouter.io/ai",
            api_key=api_key,
        )
        self._fallback_models = fallback_models

    def chat(self, primary_model: str, **kwargs: Any):
        return self._client.chat.completions.create(
            model=primary_model,
            extra_body={
                "models": self._fallback_models,
                "route": "fallback",
            },
            **kwargs,
        )


# Usage
fallback_client = FallbackClient(
    api_key=os.environ["SOLROUTER_API_KEY"],
    fallback_models=["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
)

completion = fallback_client.chat(
    primary_model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Summarise the latest AI research trends."}],
)

print(completion.choices[0].message.content)

Use cases

High availability

Provider outages happen. Even major providers like OpenAI and Anthropic experience degraded availability from time to time. A fallback chain keeps your application online during those windows without requiring you to write retry logic yourself.

Recommended chain for maximum uptime:

{
  "model": "openai/gpt-4o",
  "models": [
    "anthropic/claude-sonnet-4",
    "google/gemini-2.5-flash",
    "meta-llama/llama-4-maverick"
  ],
  "route": "fallback"
}

This chain spans three different providers — OpenAI, Anthropic, Google, and Meta — so a single provider outage is fully absorbed.

Cost optimisation

Route expensive requests through a premium model first, but fall back to a cheaper model for tasks that do not actually require the premium one. This works well when you have a mixed workload where some requests are complex and others are simple.

Premium-first with budget fallback:

{
  "model": "anthropic/claude-opus-4",
  "models": [
    "anthropic/claude-sonnet-4",
    "openai/gpt-4o-mini"
  ],
  "route": "fallback"
}

In practice this strategy is most useful when the premium model is rate-limited — once you hit its rate limit, cheaper models handle the overflow automatically.

Graceful degradation for free-tier models

If your application uses free models as the primary choice to minimise cost, fall back to a paid model when the free tier is at capacity or unavailable:

{
  "model": "meta-llama/llama-3.1-8b-instruct:free",
  "models": [
    "meta-llama/llama-3.3-70b-instruct",
    "openai/gpt-4o-mini"
  ],
  "route": "fallback"
}

This gives you zero cost when the free tier is available and seamless continuation when it is not.

Reasoning model with standard fallback

Reasoning models can occasionally be unavailable or under heavy load. Pair them with a capable standard model so your application does not stall waiting for a reasoning model to come back online:

{
  "model": "openai/o3",
  "models": [
    "openai/o4-mini",
    "anthropic/claude-3.7-sonnet:thinking",
    "openai/gpt-4o"
  ],
  "route": "fallback"
}

Fallback vs load balancing

SolRouter supports two routing strategies: "fallback" and "load-balance". They serve different purposes.

Feature"fallback""load-balance"
TriggerOnly when primary model failsEvery request
Model selectionOrdered — try first, then second, etc.Random or round-robin across all listed models
Primary use caseHigh availability, resilienceEven traffic distribution, cost averaging
Response modelFirst model that succeedsAny model in the list
LatencyNo overhead if primary succeedsNo overhead — selection is instantaneous
Cost predictabilityHigh — primary model used most of the timeLower — costs vary by which model is selected

When to use "fallback": You have a preferred model and want automatic recovery from failures.

When to use "load-balance": You want to spread load across multiple models evenly, or you are intentionally mixing models to average out their strengths and costs.

To use load balancing, set "route": "load-balance" and list all the models you want traffic distributed across in the models array.


Identifying which model responded

The model field in the response body always reflects the model that actually generated the reply. When fallback is triggered, this will differ from the model field you sent in the request.

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "anthropic/claude-sonnet-4",
  "choices": [ ... ],
  "usage": {
    "prompt_tokens": 154,
    "completion_tokens": 312,
    "total_tokens": 466,
    "cost": 0.0000054
  }
}

You can log completion.model to track which provider handled each request in production. This is useful for monitoring provider reliability and understanding the real distribution of traffic across your fallback chain.

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
  route: "fallback",
  messages: [{ role: "user", content: "Hello!" }],
} as any);

if (completion.model !== "openai/gpt-4o") {
  console.warn(`Primary model unavailable — served by fallback: ${completion.model}`);
}

Things to keep in mind

Models in the fallback chain must support the same features you use. If your request includes vision (image inputs), every model in the chain must support image input. If a fallback model does not support an input type, it will return an error and the next model will be tried.

System prompt and message format compatibility. All major providers support the standard system / user / assistant message format. However, some provider-specific extensions (e.g. Anthropic's cache_control headers) are not available on other providers' models. Keep provider-specific parameters out of requests that use cross-provider fallback chains.

Token limits vary by model. If your prompt is long and a fallback model has a smaller context window, the request to that model will fail with a context length error and the next model in the chain will be tried.

Cost is determined by the model that served the request. The usage.cost field in the response reflects the pricing of the model that actually responded, which may differ from your primary model's pricing.


Next steps

  • Available Models — browse the full catalogue to build well-matched fallback chains
  • Reasoning Models — pair reasoning models with standard fallback models
  • Token Counting — estimate costs across the models in your fallback chain
  • Streaming — SSE streaming with fallback and how buffering works