Token Counting


Every SolRouter request is billed according to the number of tokens processed by the selected model. Understanding how token counting works helps you estimate cost, avoid context window errors, and choose the right model for each workload.

This page explains what tokens are, how they are counted, how image and multimodal inputs affect usage, and how to estimate cost before you send a request.


What is a token?

A token is a chunk of text used internally by a language model. Tokens are not the same as words or characters:

  • A short word may be one token
  • A long word may be multiple tokens
  • Spaces, punctuation, symbols, and newlines also count
  • JSON, code, and structured text often produce more tokens than plain prose

As a rough rule of thumb for English text:

Text sizeApproximate token count
1 short sentence10–25 tokens
1 paragraph80–200 tokens
1 page of text500–1,000 tokens
1,000 English words~1,300 tokens

These are only estimates. The exact token count depends on the model family and tokenizer.


How SolRouter reports usage

After a request completes, the response includes a usage object. This contains the token counts that were actually billed.

Example response fragment:

{
  "usage": {
    "prompt_tokens": 312,
    "completion_tokens": 87,
    "total_tokens": 399,
    "cost": 0.0000148
  }
}

Usage fields

FieldMeaning
prompt_tokensTokens in your input: system prompt, messages, tools, images, and other request metadata
completion_tokensTokens generated by the model in its response
total_tokensprompt_tokens + completion_tokens
costFinal billed cost in USD for that request

cost is the most important field for billing. It reflects the actual amount deducted from your balance for the completed request.


Cost formula

The cost of a request is determined by the model’s input and output pricing.

Basic formula

cost =
  (prompt_tokens × input_price_per_token) +
  (completion_tokens × output_price_per_token)

Because pricing is usually shown per million tokens, it is often easier to think of it this way:

cost =
  (prompt_tokens / 1,000,000 × input_price_per_million) +
  (completion_tokens / 1,000,000 × output_price_per_million)

Example

Suppose you send a request to a model priced at:

  • $3.00 / million input tokens
  • $15.00 / million output tokens

And the request uses:

  • prompt_tokens = 2,000
  • completion_tokens = 500

Then:

input cost  = 2,000 / 1,000,000 × 3.00   = 0.006
output cost =   500 / 1,000,000 × 15.00  = 0.0075
total cost  = 0.0135

So the request costs:

$0.0135

What counts toward prompt_tokens

Many developers think only the visible message text is counted. In practice, prompt_tokens often includes more than that.

The following typically contribute to prompt usage:

  • System prompts
  • User messages
  • Assistant messages from previous turns
  • Tool definitions
  • Function schemas
  • JSON schemas used for structured output
  • Image or multimodal input representations
  • Internal formatting required by the model provider

This means two requests with the same visible prompt may still have different token counts if one includes:

  • Long conversation history
  • Large tool definitions
  • Large JSON schemas
  • Attached images or files

Example: chat request and token usage

Request

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4",
  messages: [
    {
      role: "system",
      content: "You are a concise technical assistant.",
    },
    {
      role: "user",
      content: "Explain what a context window is in LLMs.",
    },
  ],
});

console.log(completion.usage);

Possible usage output

{
  "prompt_tokens": 42,
  "completion_tokens": 121,
  "total_tokens": 163,
  "cost": 0.000573
}

Even in this simple request, the system message and request formatting are included in prompt_tokens.


Conversation history and token growth

In multi-turn conversations, each new request typically includes some or all prior messages. This means prompt usage grows over time.

Example

Turn 1:

[
  { "role": "user", "content": "Hello, who are you?" }
]

Turn 2:

[
  { "role": "user", "content": "Hello, who are you?" },
  { "role": "assistant", "content": "I'm an AI assistant." },
  { "role": "user", "content": "Can you explain token counting?" }
]

By turn 2, the request includes:

  • The original user message
  • The previous assistant reply
  • The new user message

So prompt_tokens is higher than it was on turn 1.

Why this matters

Long-running chats can become expensive and may eventually exceed the model’s context window. To control costs and latency:

  • Trim old messages
  • Summarise older context
  • Use smaller models for routine turns
  • Limit verbose system prompts
  • Remove unnecessary tool definitions when not needed

Context window limits

Every model has a maximum context window. This is the total number of tokens the model can consider in one request.

That limit includes:

  • Prompt tokens
  • Completion tokens you ask the model to generate

If your prompt is too large, or if your prompt plus requested output exceeds the model’s limit, the request may fail.

Example

If a model supports a 128k context window:

  • Your input may use up to roughly 128,000 tokens
  • But if you want the model to generate 4,000 tokens, your prompt must leave room for those 4,000 output tokens

Typical failure case

You send:

  • prompt_tokens ≈ 127,500
  • max_tokens = 4,000

This exceeds the available context budget, so the request may be rejected.

Best practices

  • Leave output headroom when setting max_tokens
  • Trim older messages before retrying
  • Prefer long-context models for documents, transcripts, and large codebases

Estimating tokens before sending a request

For applications that need budgeting, quota checks, or preflight validation, estimate tokens locally before making the API call.

Keep in mind:

  • Estimates are useful
  • Final billing is based on the provider’s actual tokenization and usage accounting
  • Different model families may tokenize the same text differently

JavaScript with js-tiktoken

import { encodingForModel } from "js-tiktoken";

const enc = encodingForModel("gpt-4o");

const text = "Explain the difference between tokens and words.";
const tokens = enc.encode(text);

console.log(tokens.length);

Estimating a full chat payload

import { encodingForModel } from "js-tiktoken";

const enc = encodingForModel("gpt-4o");

const messages = [
  { role: "system", content: "You are a concise assistant." },
  { role: "user", content: "Summarise this text in 3 bullet points." },
];

const approximatePromptTokens = messages.reduce((sum, msg) => {
  return sum + enc.encode(msg.content).length;
}, 0);

console.log({ approximatePromptTokens });

This estimate will not perfectly match the billed token count, because chat formatting and provider-specific serialization are not fully represented.

Python with tiktoken

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Explain the difference between tokens and words."
tokens = enc.encode(text)

print(len(tokens))

Estimating message history in Python

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

messages = [
    {"role": "system", "content": "You are a concise assistant."},
    {"role": "user", "content": "Summarise this text in 3 bullet points."},
]

approx_prompt_tokens = sum(len(enc.encode(m["content"])) for m in messages)
print(approx_prompt_tokens)

Image tokens and multimodal inputs

For multimodal models, images are not free. Image inputs contribute to prompt_tokens, but the exact token cost depends on:

  • The selected model
  • Image dimensions
  • How the provider internally resizes or tiles the image
  • Whether the model supports low-detail vs high-detail processing

Important points

  • A small image usually costs fewer tokens than a large one
  • Multiple images increase prompt usage
  • High-resolution images may significantly increase cost
  • Different providers price image processing differently

Example request with image input

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Describe this chart." },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/chart.png",
          },
        },
      ],
    },
  ],
});

console.log(completion.usage);

The returned prompt_tokens includes both:

  • The text portion
  • The image processing cost

Practical advice for image-heavy workloads

  • Resize oversized images before upload
  • Avoid sending multiple near-identical images
  • Use cheaper multimodal models for simple OCR or captioning tasks
  • Inspect usage.cost after a few sample requests before scaling up

Free models and token counting

Free models still produce token counts in usage, even when the request cost is zero.

Example:

{
  "usage": {
    "prompt_tokens": 441,
    "completion_tokens": 96,
    "total_tokens": 537,
    "cost": 0
  }
}

This is useful because you can still measure:

  • Prompt size
  • Response length
  • Relative efficiency
  • Whether a workflow will fit within context limits

The only difference is that no paid credit is deducted for that request.


Tool calling and structured output increase prompt size

Features like tool calling and structured output are powerful, but they also add tokens.

Tool calling adds:

  • Tool names
  • Descriptions
  • JSON parameter schemas

Structured output adds:

  • JSON schema definitions
  • Validation instructions
  • Additional formatting constraints

Example with a tool definition:

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o-mini",
  messages: [
    { role: "user", content: "What's the weather in Berlin?" },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Fetches weather for a city",
        parameters: {
          type: "object",
          properties: {
            city: { type: "string" },
          },
          required: ["city"],
        },
      },
    },
  ],
});

That schema contributes to prompt_tokens, even though the user never typed it.

If you define many tools or very large schemas, usage can grow quickly.


Prompt caching fields

Some models expose prompt caching-related pricing fields in the model metadata:

  • PricingCacheRead
  • PricingCacheWrite

These fields indicate that the provider may support reduced pricing for cached prompt segments.

What these mean

FieldMeaning
PricingCacheWriteCost for writing reusable prompt content into cache
PricingCacheReadReduced cost when the model reuses cached prompt content

Not all models support prompt caching, and the request format depends on the underlying provider’s capabilities.

When available, prompt caching can reduce cost for workloads that repeatedly reuse large prefixes such as:

  • Long system prompts
  • Large policy documents
  • Repeated codebase context
  • Reusable RAG context blocks

If a model does not expose cache pricing fields, assume standard prompt billing applies.


Building a local cost estimator

A practical approach is to estimate tokens locally, then apply the model’s published pricing.

TypeScript example

type Pricing = {
  inputPerMillion: number;
  outputPerMillion: number;
};

function estimateCost(
  promptTokens: number,
  completionTokens: number,
  pricing: Pricing,
): number {
  const inputCost =
    (promptTokens / 1_000_000) * pricing.inputPerMillion;

  const outputCost =
    (completionTokens / 1_000_000) * pricing.outputPerMillion;

  return inputCost + outputCost;
}

const estimated = estimateCost(2500, 800, {
  inputPerMillion: 3.0,
  outputPerMillion: 15.0,
});

console.log(estimated);

Python example

def estimate_cost(prompt_tokens: int, completion_tokens: int, input_per_million: float, output_per_million: float) -> float:
    input_cost = (prompt_tokens / 1_000_000) * input_per_million
    output_cost = (completion_tokens / 1_000_000) * output_per_million
    return input_cost + output_cost

estimated = estimate_cost(2500, 800, 3.0, 15.0)
print(estimated)

This is useful for:

  • Pre-request budgeting
  • Internal quotas
  • Cost previews in your UI
  • Guardrails before expensive long-context jobs

Common mistakes

1. Counting only the latest user message

Wrong assumption:

  • “My prompt is only 50 tokens”

Reality:

  • The request may also include system prompts, message history, tools, and schemas

2. Ignoring output headroom

Wrong assumption:

  • “The prompt fits in the model’s context window”

Reality:

  • You also need room for the response

3. Underestimating image cost

Wrong assumption:

  • “The image is just one attachment”

Reality:

  • Images may consume substantial prompt budget depending on size and model

4. Assuming all models tokenize identically

Wrong assumption:

  • “This estimate will be exact everywhere”

Reality:

  • Different providers and model families may produce different token counts

5. Ignoring conversation growth

Wrong assumption:

  • “Each turn costs about the same”

Reality:

  • Multi-turn chats often get more expensive unless you trim history

Best practices

  • Keep system prompts concise
  • Trim long message histories
  • Use the cheapest model that reliably solves the task
  • Estimate tokens locally for high-volume workloads
  • Check usage.cost after live requests and calibrate your estimates
  • Leave context headroom for output tokens
  • Resize images before sending them
  • Avoid oversized tool definitions and schemas unless necessary

Next steps