Model Fallback
Model fallback lets you define an ordered list of models to try in sequence. If the primary model is unavailable — due to a provider outage, rate limit, or capacity constraint — SolRouter automatically retries the request with the next model in the chain, completely transparently to your application.
Your code receives a successful response regardless of which model ultimately served it. The model field in the response tells you which one was used.
How fallback works
When you include a models array and "route": "fallback" in your request body, SolRouter follows this process:
- Attempt the request with the first model in the
modelsarray - If that model returns a
5xxerror, a timeout, or a capacity/rate-limit error, move to the next model - Repeat until a model succeeds or the list is exhausted
- If all models fail, return the last error to your application
The entire retry chain happens server-side. There is no additional latency from round-trips between your application and SolRouter — only the latency of the provider calls themselves.
Your app
│
▼
SolRouter receives request with models: [A, B, C] + route: "fallback"
│
├─ Try model A ──► Provider A unavailable (503)
│ │
├─ Try model B ──────────┘ ──► Provider B rate limited (429)
│ │
└─ Try model C ──────────────────────┘ ──► Success ✓
│
▼
Response returned to your app
Configuring fallback
Add two fields to your standard chat completions request body:
models— an ordered array of model IDs to try, from highest to lowest preferenceroute— set to"fallback"to enable the failover behaviour
The top-level model field is still required for compatibility with OpenAI SDK clients. SolRouter treats it as the implicit first entry — the models array represents the fallback sequence that follows if the primary model fails.
Minimal example (raw JSON)
{
"model": "openai/gpt-4o",
"models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
"route": "fallback",
"messages": [
{ "role": "user", "content": "Summarise this document for me." }
]
}
In this request:
- SolRouter first tries
openai/gpt-4o - If that fails, it tries
anthropic/claude-sonnet-4 - If that also fails, it tries
google/gemini-2.5-flash - If all three fail, the error from the last attempt is returned
Extended fallback chain
You can include as many models as you need. Longer chains provide more resilience at the cost of potentially higher latency in degraded scenarios.
{
"model": "openai/gpt-4.1",
"models": [
"anthropic/claude-opus-4",
"google/gemini-2.5-pro",
"openai/gpt-4o",
"anthropic/claude-sonnet-4",
"google/gemini-2.5-flash",
"meta-llama/llama-4-maverick"
],
"route": "fallback",
"messages": [
{ "role": "user", "content": "Analyse the attached quarterly report." }
]
}
TypeScript examples
Using the OpenAI SDK with extra body fields
The OpenAI SDK passes unknown body fields through to the API untouched, making it easy to add models and route without any custom logic.
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.solrouter.io/ai",
apiKey: process.env.SOLROUTER_API_KEY,
});
const completion = await client.chat.completions.create({
model: "openai/gpt-4o",
models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
route: "fallback",
messages: [{ role: "user", content: "What are the key risks in this contract?" }],
} as any);
// Check which model actually served the request
console.log("Served by:", completion.model);
console.log(completion.choices[0].message.content);
Using fetch directly
const response = await fetch("https://api.solrouter.io/ai/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.SOLROUTER_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "openai/gpt-4o",
models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
route: "fallback",
messages: [{ role: "user", content: "What are the key risks in this contract?" }],
}),
});
const data = await response.json();
// data.model tells you which model in the chain was used
console.log("Served by:", data.model);
console.log(data.choices[0].message.content);
With streaming
Fallback works transparently with streaming. SolRouter begins streaming from the first model that accepts the request — if a model fails before streaming starts, the fallback is tried before any tokens are sent to your client.
const stream = await client.chat.completions.create({
model: "openai/gpt-4o",
models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
route: "fallback",
messages: [
{
role: "system",
content: "You are a helpful assistant. Be concise.",
},
{
role: "user",
content: "Explain the CAP theorem in plain English.",
},
],
stream: true,
} as any);
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
Reusable helper function
import OpenAI, { type ClientOptions } from "openai";
interface FallbackCompletionOptions
extends Omit<OpenAI.ChatCompletionCreateParamsNonStreaming, "model"> {
model: string;
models?: string[];
}
async function createWithFallback(
client: OpenAI,
options: FallbackCompletionOptions
): Promise<OpenAI.ChatCompletion> {
return client.chat.completions.create({
...options,
route: "fallback",
} as any);
}
// Usage
const client = new OpenAI({
baseURL: "https://api.solrouter.io/ai",
apiKey: process.env.SOLROUTER_API_KEY,
});
const completion = await createWithFallback(client, {
model: "openai/gpt-4o",
models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
messages: [{ role: "user", content: "Hello!" }],
});
Python examples
Using the OpenAI SDK
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.solrouter.io/ai",
api_key=os.environ["SOLROUTER_API_KEY"],
)
completion = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "What are the key risks in this contract?"}],
extra_body={
"models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
"route": "fallback",
},
)
# Check which model served the request
print("Served by:", completion.model)
print(completion.choices[0].message.content)
Using httpx directly
import httpx
import os
response = httpx.post(
"https://api.solrouter.io/ai/chat/completions",
headers={
"Authorization": f"Bearer {os.environ['SOLROUTER_API_KEY']}",
"Content-Type": "application/json",
},
json={
"model": "openai/gpt-4o",
"models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
"route": "fallback",
"messages": [{"role": "user", "content": "What are the key risks in this contract?"}],
},
timeout=60.0,
)
data = response.json()
print("Served by:", data["model"])
print(data["choices"][0]["message"]["content"])
Streaming with fallback (Python)
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.solrouter.io/ai",
api_key=os.environ["SOLROUTER_API_KEY"],
)
stream = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Explain the CAP theorem in plain English."}],
stream=True,
extra_body={
"models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
"route": "fallback",
},
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Reusable wrapper class (Python)
from openai import OpenAI
from typing import Any
import os
class FallbackClient:
"""OpenAI-compatible client with automatic model fallback."""
def __init__(self, api_key: str, fallback_models: list[str]):
self._client = OpenAI(
base_url="https://api.solrouter.io/ai",
api_key=api_key,
)
self._fallback_models = fallback_models
def chat(self, primary_model: str, **kwargs: Any):
return self._client.chat.completions.create(
model=primary_model,
extra_body={
"models": self._fallback_models,
"route": "fallback",
},
**kwargs,
)
# Usage
fallback_client = FallbackClient(
api_key=os.environ["SOLROUTER_API_KEY"],
fallback_models=["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
)
completion = fallback_client.chat(
primary_model="openai/gpt-4o",
messages=[{"role": "user", "content": "Summarise the latest AI research trends."}],
)
print(completion.choices[0].message.content)
Use cases
High availability
Provider outages happen. Even major providers like OpenAI and Anthropic experience degraded availability from time to time. A fallback chain keeps your application online during those windows without requiring you to write retry logic yourself.
Recommended chain for maximum uptime:
{
"model": "openai/gpt-4o",
"models": [
"anthropic/claude-sonnet-4",
"google/gemini-2.5-flash",
"meta-llama/llama-4-maverick"
],
"route": "fallback"
}
This chain spans three different providers — OpenAI, Anthropic, Google, and Meta — so a single provider outage is fully absorbed.
Cost optimisation
Route expensive requests through a premium model first, but fall back to a cheaper model for tasks that do not actually require the premium one. This works well when you have a mixed workload where some requests are complex and others are simple.
Premium-first with budget fallback:
{
"model": "anthropic/claude-opus-4",
"models": [
"anthropic/claude-sonnet-4",
"openai/gpt-4o-mini"
],
"route": "fallback"
}
In practice this strategy is most useful when the premium model is rate-limited — once you hit its rate limit, cheaper models handle the overflow automatically.
Graceful degradation for free-tier models
If your application uses free models as the primary choice to minimise cost, fall back to a paid model when the free tier is at capacity or unavailable:
{
"model": "meta-llama/llama-3.1-8b-instruct:free",
"models": [
"meta-llama/llama-3.3-70b-instruct",
"openai/gpt-4o-mini"
],
"route": "fallback"
}
This gives you zero cost when the free tier is available and seamless continuation when it is not.
Reasoning model with standard fallback
Reasoning models can occasionally be unavailable or under heavy load. Pair them with a capable standard model so your application does not stall waiting for a reasoning model to come back online:
{
"model": "openai/o3",
"models": [
"openai/o4-mini",
"anthropic/claude-3.7-sonnet:thinking",
"openai/gpt-4o"
],
"route": "fallback"
}
Fallback vs load balancing
SolRouter supports two routing strategies: "fallback" and "load-balance". They serve different purposes.
| Feature | "fallback" | "load-balance" |
|---|---|---|
| Trigger | Only when primary model fails | Every request |
| Model selection | Ordered — try first, then second, etc. | Random or round-robin across all listed models |
| Primary use case | High availability, resilience | Even traffic distribution, cost averaging |
| Response model | First model that succeeds | Any model in the list |
| Latency | No overhead if primary succeeds | No overhead — selection is instantaneous |
| Cost predictability | High — primary model used most of the time | Lower — costs vary by which model is selected |
When to use "fallback": You have a preferred model and want automatic recovery from failures.
When to use "load-balance": You want to spread load across multiple models evenly, or you are intentionally mixing models to average out their strengths and costs.
To use load balancing, set "route": "load-balance" and list all the models you want traffic distributed across in the models array.
Identifying which model responded
The model field in the response body always reflects the model that actually generated the reply. When fallback is triggered, this will differ from the model field you sent in the request.
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "anthropic/claude-sonnet-4",
"choices": [ ... ],
"usage": {
"prompt_tokens": 154,
"completion_tokens": 312,
"total_tokens": 466,
"cost": 0.0000054
}
}
You can log completion.model to track which provider handled each request in production. This is useful for monitoring provider reliability and understanding the real distribution of traffic across your fallback chain.
const completion = await client.chat.completions.create({
model: "openai/gpt-4o",
models: ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
route: "fallback",
messages: [{ role: "user", content: "Hello!" }],
} as any);
if (completion.model !== "openai/gpt-4o") {
console.warn(`Primary model unavailable — served by fallback: ${completion.model}`);
}
Things to keep in mind
Models in the fallback chain must support the same features you use. If your request includes vision (image inputs), every model in the chain must support image input. If a fallback model does not support an input type, it will return an error and the next model will be tried.
System prompt and message format compatibility. All major providers support the standard system / user / assistant message format. However, some provider-specific extensions (e.g. Anthropic's cache_control headers) are not available on other providers' models. Keep provider-specific parameters out of requests that use cross-provider fallback chains.
Token limits vary by model. If your prompt is long and a fallback model has a smaller context window, the request to that model will fail with a context length error and the next model in the chain will be tried.
Cost is determined by the model that served the request. The usage.cost field in the response reflects the pricing of the model that actually responded, which may differ from your primary model's pricing.
Next steps
- Available Models — browse the full catalogue to build well-matched fallback chains
- Reasoning Models — pair reasoning models with standard fallback models
- Token Counting — estimate costs across the models in your fallback chain
- Streaming — SSE streaming with fallback and how buffering works