Vision & Multimodal
SolRouter supports multimodal requests for models that can process more than plain text. Depending on the selected model, you can send:
- text
- images
- files
- audio
- video
This page explains how multimodal input works, how to structure requests, how to choose the right model, and what to watch for in production.
Base URL
https://api.solrouter.io/ai
What “multimodal” means
A multimodal model can accept multiple input types in a single request.
Examples:
- ask a model to describe an image
- extract data from a PDF or invoice
- analyze a chart screenshot
- summarize a video clip
- transcribe or reason about audio
- combine text instructions with an attached image or file
In SolRouter, multimodal requests use the same chat completions API as text-only requests. The main difference is that the content field of a message can become an array of typed input blocks instead of a single string.
Supported input types
The exact capabilities depend on the selected model, but the common multimodal input categories are:
| Input type | Description | Example use cases |
|---|---|---|
text | Plain text instructions or conversation history | chat, extraction, summarization |
image_url | Remote image URL or data URL | OCR, screenshot analysis, chart explanation |
file | Structured or unstructured document input where supported | invoices, PDFs, reports |
input_audio | Audio input where supported | transcription, summarization, voice analysis |
video | Video-capable input on supported models | scene understanding, clip summarization |
To see whether a specific model supports image, file, audio, or video input, check the Models catalogue or the Available Models documentation.
Text-only vs multimodal messages
A plain text message looks like this:
{
"role": "user",
"content": "Summarise this document."
}
A multimodal message uses an array of content blocks:
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/chart.png"
}
}
]
}
This lets you combine instructions and media in a single request.
Image input
Image input is the most common multimodal workflow. You can send:
- a public remote image URL
- a signed temporary URL
- a
data:URI with base64-encoded image data
Example request with a remote image
{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the chart and identify the overall trend."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/chart.png"
}
}
]
}
]
}
Example request with curl
curl https://api.solrouter.io/ai/chat/completions \
-H "Authorization: Bearer $SOLROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg"
}
}
]
}
]
}'
TypeScript example
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.solrouter.io/ai",
apiKey: process.env.SOLROUTER_API_KEY,
});
const completion = await client.chat.completions.create({
model: "openai/gpt-4o",
messages: [
{
role: "user",
content: [
{
type: "text",
text: "What is happening in this image?",
},
{
type: "image_url",
image_url: {
url: "https://example.com/photo.jpg",
},
},
],
},
],
});
console.log(completion.choices[0].message.content);
Python example
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.solrouter.io/ai",
api_key=os.environ["SOLROUTER_API_KEY"],
)
completion = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is happening in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg"
}
}
]
}
]
)
print(completion.choices[0].message.content)
image_url object
The most common image block shape is:
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png"
}
}
Common fields
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be image_url |
image_url.url | string | Yes | Public or signed image URL, or a data URI |
image_url.detail | string | No | Optional detail hint such as low, high, or auto, where supported |
Example with a detail hint
{
"type": "image_url",
"image_url": {
"url": "https://example.com/receipt.jpg",
"detail": "high"
}
}
Use a higher detail mode when you need:
- OCR
- small text extraction
- invoice parsing
- dense charts
- UI screenshots with many labels
Use lower detail when you only need:
- coarse scene understanding
- simple object recognition
- quick classification
Data URL images
If your image is local or user-uploaded, you can embed it as a data: URL.
TypeScript example
import fs from "node:fs";
const bytes = fs.readFileSync("receipt.jpg");
const base64 = bytes.toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;
const completion = await client.chat.completions.create({
model: "openai/gpt-4o",
messages: [
{
role: "user",
content: [
{
type: "text",
text: "Extract the invoice total and invoice number.",
},
{
type: "image_url",
image_url: {
url: dataUrl,
},
},
],
},
],
});
Python example
import base64
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.solrouter.io/ai",
api_key=os.environ["SOLROUTER_API_KEY"],
)
with open("receipt.jpg", "rb") as f:
encoded = base64.b64encode(f.read()).decode("utf-8")
data_url = f"data:image/jpeg;base64,{encoded}"
completion = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the invoice total and invoice number."
},
{
"type": "image_url",
"image_url": {
"url": data_url
}
}
]
}
]
)
print(completion.choices[0].message.content)
When to use data URLs
Use data URLs when:
- the file is local
- the image is private
- you do not want to expose a public URL
- you are handling user uploads directly in your backend
Avoid data URLs when the media is very large, because request payloads can become heavy and increase latency.
File input
Some models support direct file or document-style input. This is useful for:
- PDFs
- reports
- invoices
- technical documents
- policy documents
- long-form context
The exact file block format depends on provider capabilities, but the common pattern is a typed content block alongside a text instruction.
Example conceptual request
{
"model": "anthropic/claude-sonnet-4",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all invoice line items from this file."
},
{
"type": "file",
"file": {
"url": "https://example.com/invoice.pdf"
}
}
]
}
]
}
Practical guidance
For file-heavy workflows:
- use models that explicitly support
fileinput - prefer signed or short-lived URLs for private documents
- keep file size reasonable
- consider splitting large documents into smaller chunks when possible
- use structured output to make extraction reliable
If a model does not support file input directly, convert the content to another supported representation, such as:
- extracted text
- page images
- OCR output plus original prompt
Audio input
Audio-capable models can process voice or sound input where supported by the selected model.
Typical use cases:
- transcription
- meeting summary
- call notes extraction
- audio classification
- voice assistant input
Conceptual audio request
{
"model": "openai/gpt-4o-audio-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Transcribe this audio and summarise the main points."
},
{
"type": "input_audio",
"input_audio": {
"data": "BASE64_AUDIO_DATA",
"format": "wav"
}
}
]
}
]
}
Best practices for audio
- use the exact format supported by the selected model
- keep clips reasonably short unless you are using a long-context audio model
- avoid sending low-quality or noisy recordings if accurate transcription matters
- consider pre-processing speech if your application handles uploads directly
If your chosen model does not support audio input, use a transcription step first, then send the transcript as text.
Video input
Video-capable models can analyze motion, scenes, or temporal content when supported.
Typical use cases:
- summarizing short clips
- extracting scene-level details
- detecting workflow steps
- interpreting screen recordings
- reasoning about a sequence of frames
Conceptual video request
{
"model": "google/gemini-2.5-pro",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarise this clip and identify the main action."
},
{
"type": "video",
"video": {
"url": "https://example.com/demo.mp4"
}
}
]
}
]
}
Best practices for video
- use video-capable models only when necessary
- keep clips short for lower cost and faster responses
- if possible, trim to the relevant section before upload
- for UI or tutorial video analysis, screen captures or key frames may be sufficient
- for long videos, consider extracting representative frames or summaries first
Combining text with media
The strongest multimodal results usually come from pairing media with clear instructions.
Instead of asking:
What is this?
Prefer:
Read this invoice image and return:
- invoice number
- invoice date
- subtotal
- tax
- total
- currency
Or:
Look at this product screenshot and identify:
- product name
- visible price
- branding
- likely category
The media gives the model evidence; the text tells it what to do with that evidence.
Example: image + structured output
{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the invoice number, total, and currency from this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/invoice.jpg",
"detail": "high"
}
}
]
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "invoice_extraction",
"schema": {
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"total": { "type": "number" },
"currency": { "type": "string" }
},
"required": ["invoice_number", "total", "currency"],
"additionalProperties": false
}
}
}
}
This is one of the most reliable production patterns for document extraction.
Token usage and cost
Multimodal input increases prompt usage.
Your usage object may include:
prompt_tokenscompletion_tokenstotal_tokenscost
Images, files, audio, and video are typically counted as part of prompt_tokens, but the exact accounting depends on the selected model and provider.
Example response fragment
{
"usage": {
"prompt_tokens": 441,
"completion_tokens": 96,
"total_tokens": 537,
"cost": 0.00091
}
}
Practical cost advice
- high-resolution images usually cost more than small ones
- multiple images increase prompt usage
- large schemas plus multimodal input can grow token counts quickly
- detailed OCR and extraction tasks usually cost more than simple captioning
- long audio or video analysis can become expensive faster than text-only requests
For more on token accounting, see Token Counting.
Full TypeScript extraction example
import OpenAI from "openai";
import { z } from "zod";
const InvoiceSchema = z.object({
invoice_number: z.string(),
total: z.number(),
currency: z.string(),
});
const client = new OpenAI({
baseURL: "https://api.solrouter.io/ai",
apiKey: process.env.SOLROUTER_API_KEY,
});
const completion = await client.chat.completions.create({
model: "openai/gpt-4o",
messages: [
{
role: "user",
content: [
{
type: "text",
text: "Extract the invoice number, total, and currency from this image.",
},
{
type: "image_url",
image_url: {
url: "https://example.com/invoice.jpg",
detail: "high",
},
},
],
},
],
response_format: {
type: "json_schema",
json_schema: {
name: "invoice_extraction",
schema: {
type: "object",
properties: {
invoice_number: { type: "string" },
total: { type: "number" },
currency: { type: "string" },
},
required: ["invoice_number", "total", "currency"],
additionalProperties: false,
},
},
},
});
const raw = completion.choices[0].message.content ?? "{}";
const parsed = InvoiceSchema.parse(JSON.parse(raw));
console.log(parsed);
console.log(completion.usage);
Full Python extraction example
from openai import OpenAI
from pydantic import BaseModel
import json
import os
class Invoice(BaseModel):
invoice_number: str
total: float
currency: str
client = OpenAI(
base_url="https://api.solrouter.io/ai",
api_key=os.environ["SOLROUTER_API_KEY"],
)
completion = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the invoice number, total, and currency from this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/invoice.jpg",
"detail": "high"
}
}
]
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "invoice_extraction",
"schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"total": {"type": "number"},
"currency": {"type": "string"}
},
"required": ["invoice_number", "total", "currency"],
"additionalProperties": False
}
}
}
)
raw = completion.choices[0].message.content or "{}"
parsed = Invoice.model_validate(json.loads(raw))
print(parsed)
print(completion.usage)
Common mistakes
1. Using a text-only model for image input
Not every model can process media. Always verify modality support first.
2. Sending vague instructions
“Describe this” is often much less useful than a task-specific prompt.
3. Ignoring image quality
Low-resolution or blurry images reduce OCR and extraction accuracy.
4. Overloading one request with too much media
Multiple large images, files, and schemas in one request can increase both latency and cost.
5. Skipping output validation
If you use multimodal extraction in production, validate the parsed result with Zod, Pydantic, or equivalent runtime validation.
6. Treating signed URLs as permanent
If you use short-lived or signed URLs, ensure they remain valid long enough for the request to complete.
7. Forgetting privacy implications
Do not send sensitive customer media to a model unless your application, policies, and provider choices allow it.
Best practices
Choose the right model
Use a model that explicitly supports the media type you need.
Give precise instructions
Tell the model exactly what to extract, describe, classify, or summarize.
Prefer structured output for extraction
For invoices, forms, screenshots, and records, combine multimodal input with json_schema.
Keep media focused
Crop images, trim clips, and remove irrelevant pages where possible.
Validate everything
Treat multimodal extraction output like any other external input.
Measure cost with real requests
Inspect usage.cost and calibrate based on actual traffic.
Use signed URLs for private media
Avoid exposing sensitive documents or images publicly when a signed URL or backend-hosted file flow is possible.
Security considerations
Multimodal requests often include user-generated media, which can be sensitive.
Be careful with:
- receipts
- invoices
- identification documents
- customer screenshots
- internal dashboards
- meeting recordings
- support attachments
Security recommendations
- avoid public long-lived URLs for sensitive media
- use short-lived signed URLs when possible
- validate uploaded file types before forwarding them
- scrub or redact sensitive content if required by policy
- log metadata, not raw media, unless you truly need the media for debugging
- keep your API key server-side for production integrations
For broader credential guidance, see Security Best Practices.
When to use multimodal vs preprocessing
Multimodal input is powerful, but sometimes preprocessing is the better choice.
Prefer multimodal directly when:
- the model needs visual layout
- formatting matters
- OCR quality must remain high
- screenshots or charts are central to the task
- audio or video context matters
Prefer preprocessing when:
- you only need plain text content
- you already have OCR or transcription available
- you need lower cost
- you want deterministic preprocessing before model inference
Examples
| Task | Better approach |
|---|---|
| screenshot analysis | multimodal |
| invoice OCR with layout-sensitive fields | multimodal |
| summarizing a clean transcript | text after preprocessing |
| classifying support emails with attachments | depends on whether attachment layout matters |
| meeting summary from recorded call | transcription first, unless audio nuance matters |
Next steps
- Available Models — browse modality support and model IDs
- Token Counting — understand multimodal token usage and cost
- Structured Output — return reliable JSON from extraction workflows
- Tool Calling — combine multimodal input with live application functions
- API Reference — request and response schema details