SolRouter

Token Counting

Every SolRouter request is billed according to the number of tokens processed by the selected model. Understanding how token counting works helps you estimate cost, avoid context window errors, and choose the right model for each workload.

This page explains what tokens are, how they are counted, how image and multimodal inputs affect usage, and how to estimate cost before you send a request.

What is a token?

A token is a chunk of text used internally by a language model. Tokens are not the same as words or characters:

A short word may be one token
A long word may be multiple tokens
Spaces, punctuation, symbols, and newlines also count
JSON, code, and structured text often produce more tokens than plain prose

As a rough rule of thumb for English text:

Text size	Approximate token count
1 short sentence	10–25 tokens
1 paragraph	80–200 tokens
1 page of text	500–1,000 tokens
1,000 English words	~1,300 tokens

These are only estimates. The exact token count depends on the model family and tokenizer.

How SolRouter reports usage

After a request completes, the response includes a usage object. This contains the token counts that were actually billed.

Example response fragment:

{
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 87,
    "total_tokens": 399,
    "cost": 0.0000148
  }
}

Usage fields

Field	Meaning
`prompt_tokens`	Tokens in your input: system prompt, messages, tools, images, and other request metadata
`completion_tokens`	Tokens generated by the model in its response
`total_tokens`	`prompt_tokens + completion_tokens`
`cost`	Final billed cost in USD for that request

cost is the most important field for billing. It reflects the actual amount deducted from your balance for the completed request.

Cost formula

The cost of a request is determined by the model’s input and output pricing.

Basic formula

cost =
  (prompt_tokens × input_price_per_token) +
  (completion_tokens × output_price_per_token)

Because pricing is usually shown per million tokens, it is often easier to think of it this way:

cost =
  (prompt_tokens / 1,000,000 × input_price_per_million) +
  (completion_tokens / 1,000,000 × output_price_per_million)

Example

Suppose you send a request to a model priced at:

$3.00 / million input tokens
$15.00 / million output tokens

And the request uses:

prompt_tokens = 2,000
completion_tokens = 500

Then:

input cost  = 2,000 / 1,000,000 × 3.00   = 0.006
output cost =   500 / 1,000,000 × 15.00  = 0.0075
total cost  = 0.0135

So the request costs:

$0.0135

What counts toward `prompt_tokens`

Many developers think only the visible message text is counted. In practice, prompt_tokens often includes more than that.

The following typically contribute to prompt usage:

System prompts
User messages
Assistant messages from previous turns
Tool definitions
Function schemas
JSON schemas used for structured output
Image or multimodal input representations
Internal formatting required by the model provider

This means two requests with the same visible prompt may still have different token counts if one includes:

Long conversation history
Large tool definitions
Large JSON schemas
Attached images or files

Example: chat request and token usage

Request

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4",
  messages: [
    {
      role: "system",
      content: "You are a concise technical assistant.",
    },
    {
      role: "user",
      content: "Explain what a context window is in LLMs.",
    },
  ],
});

console.log(completion.usage);

Possible usage output

{
  "prompt_tokens": 42,
  "completion_tokens": 121,
  "total_tokens": 163,
  "cost": 0.000573
}

Even in this simple request, the system message and request formatting are included in prompt_tokens.

Conversation history and token growth

In multi-turn conversations, each new request typically includes some or all prior messages. This means prompt usage grows over time.

Example

Turn 1:

[
  { "role": "user", "content": "Hello, who are you?" }
]

Turn 2:

[
  { "role": "user", "content": "Hello, who are you?" },
  { "role": "assistant", "content": "I'm an AI assistant." },
  { "role": "user", "content": "Can you explain token counting?" }
]

By turn 2, the request includes:

The original user message
The previous assistant reply
The new user message

So prompt_tokens is higher than it was on turn 1.

Why this matters

Long-running chats can become expensive and may eventually exceed the model’s context window. To control costs and latency:

Trim old messages
Summarise older context
Use smaller models for routine turns
Limit verbose system prompts
Remove unnecessary tool definitions when not needed

Context window limits

Every model has a maximum context window. This is the total number of tokens the model can consider in one request.

That limit includes:

Prompt tokens
Completion tokens you ask the model to generate

If your prompt is too large, or if your prompt plus requested output exceeds the model’s limit, the request may fail.

Example

If a model supports a 128k context window:

Your input may use up to roughly 128,000 tokens
But if you want the model to generate 4,000 tokens, your prompt must leave room for those 4,000 output tokens

Typical failure case

You send:

prompt_tokens ≈ 127,500
max_tokens = 4,000

This exceeds the available context budget, so the request may be rejected.

Best practices

Leave output headroom when setting max_tokens
Trim older messages before retrying
Prefer long-context models for documents, transcripts, and large codebases

Estimating tokens before sending a request

For applications that need budgeting, quota checks, or preflight validation, estimate tokens locally before making the API call.

Keep in mind:

Estimates are useful
Final billing is based on the provider’s actual tokenization and usage accounting
Different model families may tokenize the same text differently

JavaScript with `js-tiktoken`

import { encodingForModel } from "js-tiktoken";

const enc = encodingForModel("gpt-4o");

const text = "Explain the difference between tokens and words.";
const tokens = enc.encode(text);

console.log(tokens.length);

Estimating a full chat payload

import { encodingForModel } from "js-tiktoken";

const enc = encodingForModel("gpt-4o");

const messages = [
  { role: "system", content: "You are a concise assistant." },
  { role: "user", content: "Summarise this text in 3 bullet points." },
];

const approximatePromptTokens = messages.reduce((sum, msg) => {
  return sum + enc.encode(msg.content).length;
}, 0);

console.log({ approximatePromptTokens });

This estimate will not perfectly match the billed token count, because chat formatting and provider-specific serialization are not fully represented.

Python with `tiktoken`

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Explain the difference between tokens and words."
tokens = enc.encode(text)

print(len(tokens))

Estimating message history in Python

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

messages = [
    {"role": "system", "content": "You are a concise assistant."},
    {"role": "user", "content": "Summarise this text in 3 bullet points."},
]

approx_prompt_tokens = sum(len(enc.encode(m["content"])) for m in messages)
print(approx_prompt_tokens)

Image tokens and multimodal inputs

For multimodal models, images are not free. Image inputs contribute to prompt_tokens, but the exact token cost depends on:

The selected model
Image dimensions
How the provider internally resizes or tiles the image
Whether the model supports low-detail vs high-detail processing

Important points

A small image usually costs fewer tokens than a large one
Multiple images increase prompt usage
High-resolution images may significantly increase cost
Different providers price image processing differently

Example request with image input

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Describe this chart." },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/chart.png",
          },
        },
      ],
    },
  ],
});

console.log(completion.usage);

The returned prompt_tokens includes both:

The text portion
The image processing cost

Practical advice for image-heavy workloads

Resize oversized images before upload
Avoid sending multiple near-identical images
Use cheaper multimodal models for simple OCR or captioning tasks
Inspect usage.cost after a few sample requests before scaling up

Free models and token counting

Free models still produce token counts in usage, even when the request cost is zero.

Example:

{
  "usage": {
    "prompt_tokens": 441,
    "completion_tokens": 96,
    "total_tokens": 537,
    "cost": 0
  }
}

This is useful because you can still measure:

Prompt size
Response length
Relative efficiency
Whether a workflow will fit within context limits

The only difference is that no paid credit is deducted for that request.

Tool calling and structured output increase prompt size

Features like tool calling and structured output are powerful, but they also add tokens.

Tool calling adds:

Tool names
Descriptions
JSON parameter schemas

Structured output adds:

JSON schema definitions
Validation instructions
Additional formatting constraints

Example with a tool definition:

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o-mini",
  messages: [
    { role: "user", content: "What's the weather in Berlin?" },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Fetches weather for a city",
        parameters: {
          type: "object",
          properties: {
            city: { type: "string" },
          },
          required: ["city"],
        },
      },
    },
  ],
});

That schema contributes to prompt_tokens, even though the user never typed it.

If you define many tools or very large schemas, usage can grow quickly.

Prompt caching fields

Some models expose prompt caching-related pricing fields in the model metadata:

PricingCacheRead
PricingCacheWrite

These fields indicate that the provider may support reduced pricing for cached prompt segments.

What these mean

Field	Meaning
`PricingCacheWrite`	Cost for writing reusable prompt content into cache
`PricingCacheRead`	Reduced cost when the model reuses cached prompt content

Not all models support prompt caching, and the request format depends on the underlying provider’s capabilities.

When available, prompt caching can reduce cost for workloads that repeatedly reuse large prefixes such as:

Long system prompts
Large policy documents
Repeated codebase context
Reusable RAG context blocks

If a model does not expose cache pricing fields, assume standard prompt billing applies.

Building a local cost estimator

A practical approach is to estimate tokens locally, then apply the model’s published pricing.

TypeScript example

type Pricing = {
  inputPerMillion: number;
  outputPerMillion: number;
};

function estimateCost(
  promptTokens: number,
  completionTokens: number,
  pricing: Pricing,
): number {
  const inputCost =
    (promptTokens / 1_000_000) * pricing.inputPerMillion;

  const outputCost =
    (completionTokens / 1_000_000) * pricing.outputPerMillion;

  return inputCost + outputCost;
}

const estimated = estimateCost(2500, 800, {
  inputPerMillion: 3.0,
  outputPerMillion: 15.0,
});

console.log(estimated);

Python example

def estimate_cost(prompt_tokens: int, completion_tokens: int, input_per_million: float, output_per_million: float) -> float:
    input_cost = (prompt_tokens / 1_000_000) * input_per_million
    output_cost = (completion_tokens / 1_000_000) * output_per_million
    return input_cost + output_cost

estimated = estimate_cost(2500, 800, 3.0, 15.0)
print(estimated)

This is useful for:

Pre-request budgeting
Internal quotas
Cost previews in your UI
Guardrails before expensive long-context jobs

Common mistakes

1. Counting only the latest user message

Wrong assumption:

“My prompt is only 50 tokens”

Reality:

The request may also include system prompts, message history, tools, and schemas

2. Ignoring output headroom

Wrong assumption:

“The prompt fits in the model’s context window”

Reality:

You also need room for the response

3. Underestimating image cost

Wrong assumption:

“The image is just one attachment”

Reality:

Images may consume substantial prompt budget depending on size and model

4. Assuming all models tokenize identically

Wrong assumption:

“This estimate will be exact everywhere”

Reality:

Different providers and model families may produce different token counts

5. Ignoring conversation growth

Wrong assumption:

“Each turn costs about the same”

Reality:

Multi-turn chats often get more expensive unless you trim history

Best practices

Keep system prompts concise
Trim long message histories
Use the cheapest model that reliably solves the task
Estimate tokens locally for high-volume workloads
Check usage.cost after live requests and calibrate your estimates
Leave context headroom for output tokens
Resize images before sending them
Avoid oversized tool definitions and schemas unless necessary

Next steps

Available Models — browse model IDs, pricing, and modalities
Reasoning Models — understand thinking models and when to use them
First Request — make your first call and inspect the response
Vision & Multimodal — learn how image and multimodal requests are structured
Errors — troubleshoot context window and request validation failures

SOLROUTER

Token Counting

What is a token?

How SolRouter reports usage

Usage fields

Cost formula

Basic formula

Example

What counts toward prompt_tokens

Example: chat request and token usage

Request

Possible usage output

Conversation history and token growth

Example

Why this matters

Context window limits

Example

Typical failure case

Best practices

Estimating tokens before sending a request

JavaScript with js-tiktoken

Estimating a full chat payload

Python with tiktoken

Estimating message history in Python

Image tokens and multimodal inputs

Important points

Example request with image input

Practical advice for image-heavy workloads

Free models and token counting

Tool calling and structured output increase prompt size

Tool calling adds:

Structured output adds:

Prompt caching fields

What these mean

Building a local cost estimator

TypeScript example

Python example

Common mistakes

1. Counting only the latest user message

2. Ignoring output headroom

3. Underestimating image cost

4. Assuming all models tokenize identically

5. Ignoring conversation growth

Best practices

Next steps

What counts toward `prompt_tokens`

JavaScript with `js-tiktoken`

Python with `tiktoken`