Token Counting
Every SolRouter request is billed according to the number of tokens processed by the selected model. Understanding how token counting works helps you estimate cost, avoid context window errors, and choose the right model for each workload.
This page explains what tokens are, how they are counted, how image and multimodal inputs affect usage, and how to estimate cost before you send a request.
What is a token?
A token is a chunk of text used internally by a language model. Tokens are not the same as words or characters:
- A short word may be one token
- A long word may be multiple tokens
- Spaces, punctuation, symbols, and newlines also count
- JSON, code, and structured text often produce more tokens than plain prose
As a rough rule of thumb for English text:
| Text size | Approximate token count |
|---|---|
| 1 short sentence | 10–25 tokens |
| 1 paragraph | 80–200 tokens |
| 1 page of text | 500–1,000 tokens |
| 1,000 English words | ~1,300 tokens |
These are only estimates. The exact token count depends on the model family and tokenizer.
How SolRouter reports usage
After a request completes, the response includes a usage object. This contains the token counts that were actually billed.
Example response fragment:
{
"usage": {
"prompt_tokens": 312,
"completion_tokens": 87,
"total_tokens": 399,
"cost": 0.0000148
}
}
Usage fields
| Field | Meaning |
|---|---|
prompt_tokens | Tokens in your input: system prompt, messages, tools, images, and other request metadata |
completion_tokens | Tokens generated by the model in its response |
total_tokens | prompt_tokens + completion_tokens |
cost | Final billed cost in USD for that request |
cost is the most important field for billing. It reflects the actual amount deducted from your balance for the completed request.
Cost formula
The cost of a request is determined by the model’s input and output pricing.
Basic formula
cost =
(prompt_tokens × input_price_per_token) +
(completion_tokens × output_price_per_token)
Because pricing is usually shown per million tokens, it is often easier to think of it this way:
cost =
(prompt_tokens / 1,000,000 × input_price_per_million) +
(completion_tokens / 1,000,000 × output_price_per_million)
Example
Suppose you send a request to a model priced at:
- $3.00 / million input tokens
- $15.00 / million output tokens
And the request uses:
prompt_tokens = 2,000completion_tokens = 500
Then:
input cost = 2,000 / 1,000,000 × 3.00 = 0.006
output cost = 500 / 1,000,000 × 15.00 = 0.0075
total cost = 0.0135
So the request costs:
$0.0135
What counts toward prompt_tokens
Many developers think only the visible message text is counted. In practice, prompt_tokens often includes more than that.
The following typically contribute to prompt usage:
- System prompts
- User messages
- Assistant messages from previous turns
- Tool definitions
- Function schemas
- JSON schemas used for structured output
- Image or multimodal input representations
- Internal formatting required by the model provider
This means two requests with the same visible prompt may still have different token counts if one includes:
- Long conversation history
- Large tool definitions
- Large JSON schemas
- Attached images or files
Example: chat request and token usage
Request
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.solrouter.io/ai",
apiKey: process.env.SOLROUTER_API_KEY,
});
const completion = await client.chat.completions.create({
model: "anthropic/claude-sonnet-4",
messages: [
{
role: "system",
content: "You are a concise technical assistant.",
},
{
role: "user",
content: "Explain what a context window is in LLMs.",
},
],
});
console.log(completion.usage);
Possible usage output
{
"prompt_tokens": 42,
"completion_tokens": 121,
"total_tokens": 163,
"cost": 0.000573
}
Even in this simple request, the system message and request formatting are included in prompt_tokens.
Conversation history and token growth
In multi-turn conversations, each new request typically includes some or all prior messages. This means prompt usage grows over time.
Example
Turn 1:
[
{ "role": "user", "content": "Hello, who are you?" }
]
Turn 2:
[
{ "role": "user", "content": "Hello, who are you?" },
{ "role": "assistant", "content": "I'm an AI assistant." },
{ "role": "user", "content": "Can you explain token counting?" }
]
By turn 2, the request includes:
- The original user message
- The previous assistant reply
- The new user message
So prompt_tokens is higher than it was on turn 1.
Why this matters
Long-running chats can become expensive and may eventually exceed the model’s context window. To control costs and latency:
- Trim old messages
- Summarise older context
- Use smaller models for routine turns
- Limit verbose system prompts
- Remove unnecessary tool definitions when not needed
Context window limits
Every model has a maximum context window. This is the total number of tokens the model can consider in one request.
That limit includes:
- Prompt tokens
- Completion tokens you ask the model to generate
If your prompt is too large, or if your prompt plus requested output exceeds the model’s limit, the request may fail.
Example
If a model supports a 128k context window:
- Your input may use up to roughly 128,000 tokens
- But if you want the model to generate 4,000 tokens, your prompt must leave room for those 4,000 output tokens
Typical failure case
You send:
prompt_tokens ≈ 127,500max_tokens = 4,000
This exceeds the available context budget, so the request may be rejected.
Best practices
- Leave output headroom when setting
max_tokens - Trim older messages before retrying
- Prefer long-context models for documents, transcripts, and large codebases
Estimating tokens before sending a request
For applications that need budgeting, quota checks, or preflight validation, estimate tokens locally before making the API call.
Keep in mind:
- Estimates are useful
- Final billing is based on the provider’s actual tokenization and usage accounting
- Different model families may tokenize the same text differently
JavaScript with js-tiktoken
import { encodingForModel } from "js-tiktoken";
const enc = encodingForModel("gpt-4o");
const text = "Explain the difference between tokens and words.";
const tokens = enc.encode(text);
console.log(tokens.length);
Estimating a full chat payload
import { encodingForModel } from "js-tiktoken";
const enc = encodingForModel("gpt-4o");
const messages = [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: "Summarise this text in 3 bullet points." },
];
const approximatePromptTokens = messages.reduce((sum, msg) => {
return sum + enc.encode(msg.content).length;
}, 0);
console.log({ approximatePromptTokens });
This estimate will not perfectly match the billed token count, because chat formatting and provider-specific serialization are not fully represented.
Python with tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Explain the difference between tokens and words."
tokens = enc.encode(text)
print(len(tokens))
Estimating message history in Python
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
messages = [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarise this text in 3 bullet points."},
]
approx_prompt_tokens = sum(len(enc.encode(m["content"])) for m in messages)
print(approx_prompt_tokens)
Image tokens and multimodal inputs
For multimodal models, images are not free. Image inputs contribute to prompt_tokens, but the exact token cost depends on:
- The selected model
- Image dimensions
- How the provider internally resizes or tiles the image
- Whether the model supports low-detail vs high-detail processing
Important points
- A small image usually costs fewer tokens than a large one
- Multiple images increase prompt usage
- High-resolution images may significantly increase cost
- Different providers price image processing differently
Example request with image input
const completion = await client.chat.completions.create({
model: "openai/gpt-4o",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Describe this chart." },
{
type: "image_url",
image_url: {
url: "https://example.com/chart.png",
},
},
],
},
],
});
console.log(completion.usage);
The returned prompt_tokens includes both:
- The text portion
- The image processing cost
Practical advice for image-heavy workloads
- Resize oversized images before upload
- Avoid sending multiple near-identical images
- Use cheaper multimodal models for simple OCR or captioning tasks
- Inspect
usage.costafter a few sample requests before scaling up
Free models and token counting
Free models still produce token counts in usage, even when the request cost is zero.
Example:
{
"usage": {
"prompt_tokens": 441,
"completion_tokens": 96,
"total_tokens": 537,
"cost": 0
}
}
This is useful because you can still measure:
- Prompt size
- Response length
- Relative efficiency
- Whether a workflow will fit within context limits
The only difference is that no paid credit is deducted for that request.
Tool calling and structured output increase prompt size
Features like tool calling and structured output are powerful, but they also add tokens.
Tool calling adds:
- Tool names
- Descriptions
- JSON parameter schemas
Structured output adds:
- JSON schema definitions
- Validation instructions
- Additional formatting constraints
Example with a tool definition:
const completion = await client.chat.completions.create({
model: "openai/gpt-4o-mini",
messages: [
{ role: "user", content: "What's the weather in Berlin?" },
],
tools: [
{
type: "function",
function: {
name: "get_weather",
description: "Fetches weather for a city",
parameters: {
type: "object",
properties: {
city: { type: "string" },
},
required: ["city"],
},
},
},
],
});
That schema contributes to prompt_tokens, even though the user never typed it.
If you define many tools or very large schemas, usage can grow quickly.
Prompt caching fields
Some models expose prompt caching-related pricing fields in the model metadata:
PricingCacheReadPricingCacheWrite
These fields indicate that the provider may support reduced pricing for cached prompt segments.
What these mean
| Field | Meaning |
|---|---|
PricingCacheWrite | Cost for writing reusable prompt content into cache |
PricingCacheRead | Reduced cost when the model reuses cached prompt content |
Not all models support prompt caching, and the request format depends on the underlying provider’s capabilities.
When available, prompt caching can reduce cost for workloads that repeatedly reuse large prefixes such as:
- Long system prompts
- Large policy documents
- Repeated codebase context
- Reusable RAG context blocks
If a model does not expose cache pricing fields, assume standard prompt billing applies.
Building a local cost estimator
A practical approach is to estimate tokens locally, then apply the model’s published pricing.
TypeScript example
type Pricing = {
inputPerMillion: number;
outputPerMillion: number;
};
function estimateCost(
promptTokens: number,
completionTokens: number,
pricing: Pricing,
): number {
const inputCost =
(promptTokens / 1_000_000) * pricing.inputPerMillion;
const outputCost =
(completionTokens / 1_000_000) * pricing.outputPerMillion;
return inputCost + outputCost;
}
const estimated = estimateCost(2500, 800, {
inputPerMillion: 3.0,
outputPerMillion: 15.0,
});
console.log(estimated);
Python example
def estimate_cost(prompt_tokens: int, completion_tokens: int, input_per_million: float, output_per_million: float) -> float:
input_cost = (prompt_tokens / 1_000_000) * input_per_million
output_cost = (completion_tokens / 1_000_000) * output_per_million
return input_cost + output_cost
estimated = estimate_cost(2500, 800, 3.0, 15.0)
print(estimated)
This is useful for:
- Pre-request budgeting
- Internal quotas
- Cost previews in your UI
- Guardrails before expensive long-context jobs
Common mistakes
1. Counting only the latest user message
Wrong assumption:
- “My prompt is only 50 tokens”
Reality:
- The request may also include system prompts, message history, tools, and schemas
2. Ignoring output headroom
Wrong assumption:
- “The prompt fits in the model’s context window”
Reality:
- You also need room for the response
3. Underestimating image cost
Wrong assumption:
- “The image is just one attachment”
Reality:
- Images may consume substantial prompt budget depending on size and model
4. Assuming all models tokenize identically
Wrong assumption:
- “This estimate will be exact everywhere”
Reality:
- Different providers and model families may produce different token counts
5. Ignoring conversation growth
Wrong assumption:
- “Each turn costs about the same”
Reality:
- Multi-turn chats often get more expensive unless you trim history
Best practices
- Keep system prompts concise
- Trim long message histories
- Use the cheapest model that reliably solves the task
- Estimate tokens locally for high-volume workloads
- Check
usage.costafter live requests and calibrate your estimates - Leave context headroom for output tokens
- Resize images before sending them
- Avoid oversized tool definitions and schemas unless necessary
Next steps
- Available Models — browse model IDs, pricing, and modalities
- Reasoning Models — understand thinking models and when to use them
- First Request — make your first call and inspect the response
- Vision & Multimodal — learn how image and multimodal requests are structured
- Errors — troubleshoot context window and request validation failures