Reasoning Models


Reasoning models are a class of large language models that spend additional compute "thinking" before producing a final answer. Instead of immediately generating a response token-by-token, these models run an internal chain-of-thought process — working through sub-problems, checking their own logic, and revising intermediate conclusions — before emitting the visible reply.

The result is dramatically better performance on tasks that require multi-step logic, mathematics, code debugging, scientific analysis, and any problem where the answer is not obvious from pattern matching alone.


How reasoning works

Standard language models predict the next token given the conversation history. Reasoning models insert a hidden scratchpad phase between reading the prompt and writing the response. During this phase the model:

  1. Decomposes the problem into smaller sub-tasks
  2. Works through each sub-task step by step
  3. Verifies intermediate results and backtracks when it detects an error
  4. Synthesises a final answer from the verified chain of thought

Some providers expose the raw thinking tokens in the response (DeepSeek R1, Claude :thinking), while others keep the scratchpad internal and only surface a summary or the final answer (OpenAI o-series). SolRouter forwards whatever the provider returns, so the structure of the response varies slightly by model.


Available reasoning models

OpenAI o-series

OpenAI's o models are purpose-built reasoning models that replace the standard chat completions path with an extended thinking phase. They do not accept a temperature parameter and use max_completion_tokens (not max_tokens) to bound output length.

Model IDContextInput ($/M)Output ($/M)Notes
openai/o3200k$10.00$40.00Flagship reasoning, highest accuracy
openai/o4-mini200k$1.10$4.40Fast reasoning, lower cost
openai/o3-mini200k$1.10$4.40Compact, efficient

Anthropic Claude — :thinking suffix

Anthropic offers extended thinking on select Claude models via the :thinking suffix. When you add :thinking to a model ID, SolRouter enables Claude's extended reasoning mode, which causes the model to emit thinking content blocks before the final text block.

Model IDContextInput ($/M)Output ($/M)Notes
anthropic/claude-3.7-sonnet:thinking200k$3.00$15.00Extended thinking enabled
anthropic/claude-opus-4:thinking200k$15.00$75.00Most capable, full thinking

DeepSeek R-series

DeepSeek's R1 model is fully open-weights and one of the most capable reasoning models available. It exposes its chain-of-thought in a <think>...</think> block within the message content, which you can parse or strip depending on your use case.

Model IDContextInput ($/M)Output ($/M)Notes
deepseek/deepseek-r1164k$0.55$2.19Full reasoning, open weights
deepseek/deepseek-r1-distill-llama-70b131k$0.23$0.69Distilled, faster
deepseek/deepseek-r1-distill-qwen-32b131k$0.12$0.18Smallest distillation

Google Gemini 2.5

Gemini 2.5 Pro and Flash incorporate chain-of-thought reasoning natively. Google exposes thinking token counts separately in the usage object so you can track the cost of the internal reasoning phase.

Model IDContextInput ($/M)Output ($/M)Notes
google/gemini-2.5-pro1M$1.25$10.00Deep reasoning, 1M context
google/gemini-2.5-flash1M$0.15$0.60Fast reasoning, budget option

xAI Grok 3 Mini

x-ai/grok-3-mini exposes reasoning traces and is optimised for logical tasks at a lower cost than the flagship Grok 3 model.

Model IDContextInput ($/M)Output ($/M)Notes
x-ai/grok-3-mini131k$0.30$0.50Reasoning traces exposed

When to use reasoning models

Use reasoning models when

  • The problem requires multiple logical steps where an error in one step cascades to wrong results (maths, proofs, algorithms)
  • You need reliable code generation for complex functions — reasoning models self-verify before emitting code
  • The task involves structured planning: project breakdowns, dependency graphs, multi-constraint scheduling
  • You need scientific or technical analysis that requires domain knowledge combined with careful deduction
  • Accuracy matters more than latency — you can tolerate a slower response in exchange for a much higher success rate

Use standard models when

  • The task is straightforward: summarisation, translation, simple Q&A, creative writing
  • Latency is critical — reasoning models think before they speak, adding seconds to the response time
  • You are running high-volume, low-complexity tasks where the cost premium of reasoning tokens is not justified
  • You need streaming with low time-to-first-token — reasoning models have a longer warmup before the first output token appears

Quick decision guide

ScenarioRecommended model type
"Write a haiku about autumn"Standard model
"Solve this differential equation"Reasoning model
"Summarise this article"Standard model
"Debug why this recursive function stack-overflows"Reasoning model
"Translate this paragraph to French"Standard model
"Design a rate-limiter that handles burst traffic"Reasoning model
"Write a product description"Standard model
"Prove that √2 is irrational"Reasoning model

The :thinking suffix

For Anthropic models, appending :thinking to the model ID activates Claude's extended reasoning mode. This is a SolRouter convention that maps to Anthropic's thinking parameter internally — you do not need to set any extra request fields.

// Standard Claude — fast, no extended thinking
const standard = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4",
  messages: [{ role: "user", content: "What is 17 * 23?" }],
});

// Extended thinking Claude — slower, much more accurate on hard problems
const thinking = await client.chat.completions.create({
  model: "anthropic/claude-3.7-sonnet:thinking",
  messages: [{ role: "user", content: "Prove that there are infinitely many prime numbers." }],
});

When using a :thinking model the response may include one or more thinking content blocks before the final text block. SolRouter passes these through unchanged:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": [
          {
            "type": "thinking",
            "thinking": "I need to prove there are infinitely many primes. The classic approach is Euclid's proof by contradiction..."
          },
          {
            "type": "text",
            "text": "**Proof (Euclid's theorem):** Assume for contradiction that there are finitely many primes..."
          }
        ]
      }
    }
  ]
}

If your application only needs the final answer, filter content blocks by type === "text".


Cost and token considerations

Reasoning tokens cost more

Reasoning models consume additional tokens during their internal thinking phase. These reasoning tokens are charged at the output token rate even though they do not appear in the final response content.

Token typeVisible in responseCharged
Prompt tokensYes
Reasoning / thinking tokensDepends on modelYes
Completion tokensYesYes

For OpenAI o-series models, reasoning token usage appears in usage.completion_tokens_details.reasoning_tokens. For Google Gemini 2.5, thinking tokens appear in usage.thinking_tokens. DeepSeek R1 includes its thinking text inside <think> tags, so those tokens appear as normal completion tokens.

Example usage object (OpenAI o3)

{
  "usage": {
    "prompt_tokens": 245,
    "completion_tokens": 1820,
    "total_tokens": 2065,
    "completion_tokens_details": {
      "reasoning_tokens": 1536,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "cost": 0.000891
  }
}

In this example, 1 536 of the 1 820 completion tokens were consumed by internal reasoning — only 284 tokens appear in the visible response.

max_completion_tokens vs max_tokens

OpenAI o-series models require max_completion_tokens instead of max_tokens. This parameter caps the total completion budget including reasoning tokens.

const completion = await client.chat.completions.create({
  model: "openai/o4-mini",
  messages: [{ role: "user", content: "Solve: find all integer solutions to x² + y² = 65" }],
  max_completion_tokens: 8000,  // caps reasoning + output combined
});

For Anthropic :thinking models, use the standard max_tokens parameter — it applies to the combined thinking + response token budget.

const completion = await client.chat.completions.create({
  model: "anthropic/claude-3.7-sonnet:thinking",
  messages: [{ role: "user", content: "Implement a red-black tree in TypeScript." }],
  max_tokens: 16000,  // thinking + response combined
});

Tip: Set max_completion_tokens / max_tokens generously for reasoning models. If the budget is too tight, the model may cut off its thinking mid-way and produce a lower-quality or truncated response.


Streaming with reasoning models

All reasoning models available through SolRouter support streaming via Server-Sent Events (SSE). However, there are a few behavioural differences to be aware of.

Time to first token is longer

Reasoning models have a warmup period before the first token appears. The model must begin its internal thinking phase before streaming starts. For openai/o3 this can be several seconds on hard problems. Plan your UI accordingly — show a "thinking…" indicator rather than a blank screen.

Streaming a reasoning model (TypeScript)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const stream = await client.chat.completions.create({
  model: "openai/o4-mini",
  messages: [
    {
      role: "user",
      content: "Write a recursive descent parser for arithmetic expressions in TypeScript.",
    },
  ],
  max_completion_tokens: 12000,
  stream: true,
});

process.stdout.write("Thinking");
let dotInterval = setInterval(() => process.stdout.write("."), 500);

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) {
    clearInterval(dotInterval);
    process.stdout.write(delta);
  }
}

Streaming a :thinking model (TypeScript)

When streaming a Claude :thinking model, thinking content blocks are emitted as content_block_delta events with type: "thinking_delta". SolRouter translates these into the standard SSE delta format. You can detect thinking chunks by checking the delta type:

const stream = await client.chat.completions.create({
  model: "anthropic/claude-3.7-sonnet:thinking",
  messages: [{ role: "user", content: "Derive the quadratic formula from first principles." }],
  max_tokens: 10000,
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta;
  if (!delta) continue;

  // Thinking delta — internal reasoning, shown here for transparency
  if (delta.type === "thinking") {
    process.stdout.write(`[thinking] ${delta.thinking ?? ""}`);
  }

  // Text delta — the final visible response
  if (delta.content) {
    process.stdout.write(delta.content);
  }
}

Streaming DeepSeek R1 (Python)

DeepSeek R1 embeds its reasoning inside <think>...</think> tags in the regular content stream. You can parse these out in real time:

from openai import OpenAI
import os
import re

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

stream = client.chat.completions.create(
    model="deepseek/deepseek-r1",
    messages=[{"role": "user", "content": "What is the time complexity of merge sort? Prove it."}],
    stream=True,
)

buffer = ""
in_thinking = False

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    buffer += delta

    # Detect start of thinking block
    if "<think>" in buffer and not in_thinking:
        in_thinking = True
        print("[thinking] ", end="", flush=True)
        buffer = buffer.split("<think>", 1)[1]

    # Detect end of thinking block
    if "</think>" in buffer and in_thinking:
        in_thinking = False
        parts = buffer.split("</think>", 1)
        print(parts[0], end="", flush=True)  # remainder of thinking
        print("\n[response] ", end="", flush=True)
        buffer = parts[1]
    else:
        print(buffer, end="", flush=True)
        buffer = ""

Practical examples

Complex code generation (TypeScript)

const completion = await client.chat.completions.create({
  model: "openai/o4-mini",
  messages: [
    {
      role: "system",
      content: "You are an expert TypeScript engineer. Produce clean, well-typed, production-ready code.",
    },
    {
      role: "user",
      content:
        "Implement a generic LRU cache in TypeScript with O(1) get and put operations. " +
        "Include full type parameters, JSDoc comments, and unit tests using Vitest.",
    },
  ],
  max_completion_tokens: 6000,
});

console.log(completion.choices[0].message.content);

Multi-step mathematical reasoning (Python)

completion = client.chat.completions.create(
    model="deepseek/deepseek-r1",
    messages=[
        {
            "role": "user",
            "content": (
                "A train leaves city A at 9:00 AM travelling at 80 km/h toward city B. "
                "Another train leaves city B at 10:30 AM travelling at 120 km/h toward city A. "
                "The distance between city A and city B is 600 km. "
                "At what time do the trains meet, and how far is the meeting point from city A?"
            ),
        }
    ],
)

print(completion.choices[0].message.content)

Research and analysis (Python)

completion = client.chat.completions.create(
    model="google/gemini-2.5-pro",
    messages=[
        {
            "role": "system",
            "content": "You are a senior software architect. Provide rigorous, evidence-based analysis.",
        },
        {
            "role": "user",
            "content": (
                "Compare event-driven architecture and request-response architecture "
                "for a real-time collaborative document editor with 10,000 concurrent users. "
                "Consider consistency, latency, fault tolerance, and operational complexity."
            ),
        },
    ],
    max_tokens=8000,
)

print(completion.choices[0].message.content)

Common pitfalls

Using temperature with o-series models OpenAI's o-series models do not accept a temperature parameter. Passing it will result in an error. Remove temperature from your request when targeting openai/o3 or openai/o4-mini.

Setting max_completion_tokens too low If the reasoning budget is exhausted before the model finishes thinking, it will produce a truncated or lower-quality answer. Start with at least 4000 tokens and increase for complex problems.

Expecting low latency Reasoning models trade speed for accuracy. If you have a latency-sensitive endpoint (e.g. autocomplete or live chat), use a standard model. Reserve reasoning models for asynchronous or background tasks where waiting a few extra seconds is acceptable.

Forgetting to strip thinking blocks from user-facing output If you are displaying model output directly to end users, filter out thinking content blocks (Claude) or <think>...</think> sections (DeepSeek) before rendering them. These are intended as internal reasoning aids, not polished prose.


Next steps

  • Available Models — full model catalogue with context lengths, pricing, and modality filters
  • Token Counting — understand reasoning token costs and estimate request budgets
  • Model Fallback — automatically fall back to a standard model if a reasoning model is unavailable
  • Streaming — full SSE streaming guide including React patterns and edge runtimes