SolRouter

Vision & Multimodal

SolRouter supports multimodal requests for models that can process more than plain text. Depending on the selected model, you can send:

text
images
files
audio
video

This page explains how multimodal input works, how to structure requests, how to choose the right model, and what to watch for in production.

Base URL

https://api.solrouter.io/ai

What “multimodal” means

A multimodal model can accept multiple input types in a single request.

Examples:

ask a model to describe an image
extract data from a PDF or invoice
analyze a chart screenshot
summarize a video clip
transcribe or reason about audio
combine text instructions with an attached image or file

In SolRouter, multimodal requests use the same chat completions API as text-only requests. The main difference is that the content field of a message can become an array of typed input blocks instead of a single string.

Supported input types

The exact capabilities depend on the selected model, but the common multimodal input categories are:

Input type	Description	Example use cases
`text`	Plain text instructions or conversation history	chat, extraction, summarization
`image_url`	Remote image URL or data URL	OCR, screenshot analysis, chart explanation
`file`	Structured or unstructured document input where supported	invoices, PDFs, reports
`input_audio`	Audio input where supported	transcription, summarization, voice analysis
`video`	Video-capable input on supported models	scene understanding, clip summarization

To see whether a specific model supports image, file, audio, or video input, check the Models catalogue or the Available Models documentation.

Text-only vs multimodal messages

A plain text message looks like this:

{
  "role": "user",
  "content": "Summarise this document."
}

A multimodal message uses an array of content blocks:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What is shown in this image?"
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/chart.png"
      }
    }
  ]
}

This lets you combine instructions and media in a single request.

Image input

Image input is the most common multimodal workflow. You can send:

a public remote image URL
a signed temporary URL
a data: URI with base64-encoded image data

Example request with a remote image

{
  "model": "openai/gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe the chart and identify the overall trend."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/chart.png"
          }
        }
      ]
    }
  ]
}

Example request with `curl`

curl https://api.solrouter.io/ai/chat/completions \
  -H "Authorization: Bearer $SOLROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/photo.jpg"
            }
          }
        ]
      }
    ]
  }'

TypeScript example

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "What is happening in this image?",
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/photo.jpg",
          },
        },
      ],
    },
  ],
});

console.log(completion.choices[0].message.content);

Python example

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is happening in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/photo.jpg"
                    }
                }
            ]
        }
    ]
)

print(completion.choices[0].message.content)

`image_url` object

The most common image block shape is:

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image.png"
  }
}

Common fields

Field	Type	Required	Description
`type`	`string`	Yes	Must be `image_url`
`image_url.url`	`string`	Yes	Public or signed image URL, or a data URI
`image_url.detail`	`string`	No	Optional detail hint such as `low`, `high`, or `auto`, where supported

Example with a detail hint

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/receipt.jpg",
    "detail": "high"
  }
}

Use a higher detail mode when you need:

OCR
small text extraction
invoice parsing
dense charts
UI screenshots with many labels

Use lower detail when you only need:

coarse scene understanding
simple object recognition
quick classification

Data URL images

If your image is local or user-uploaded, you can embed it as a data: URL.

TypeScript example

import fs from "node:fs";

const bytes = fs.readFileSync("receipt.jpg");
const base64 = bytes.toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Extract the invoice total and invoice number.",
        },
        {
          type: "image_url",
          image_url: {
            url: dataUrl,
          },
        },
      ],
    },
  ],
});

Python example

import base64
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

with open("receipt.jpg", "rb") as f:
    encoded = base64.b64encode(f.read()).decode("utf-8")

data_url = f"data:image/jpeg;base64,{encoded}"

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract the invoice total and invoice number."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": data_url
                    }
                }
            ]
        }
    ]
)

print(completion.choices[0].message.content)

When to use data URLs

Use data URLs when:

the file is local
the image is private
you do not want to expose a public URL
you are handling user uploads directly in your backend

Avoid data URLs when the media is very large, because request payloads can become heavy and increase latency.

File input

Some models support direct file or document-style input. This is useful for:

PDFs
reports
invoices
technical documents
policy documents
long-form context

The exact file block format depends on provider capabilities, but the common pattern is a typed content block alongside a text instruction.

Example conceptual request

{
  "model": "anthropic/claude-sonnet-4",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Extract all invoice line items from this file."
        },
        {
          "type": "file",
          "file": {
            "url": "https://example.com/invoice.pdf"
          }
        }
      ]
    }
  ]
}

Practical guidance

For file-heavy workflows:

use models that explicitly support file input
prefer signed or short-lived URLs for private documents
keep file size reasonable
consider splitting large documents into smaller chunks when possible
use structured output to make extraction reliable

If a model does not support file input directly, convert the content to another supported representation, such as:

extracted text
page images
OCR output plus original prompt

Audio input

Audio-capable models can process voice or sound input where supported by the selected model.

Typical use cases:

transcription
meeting summary
call notes extraction
audio classification
voice assistant input

Conceptual audio request

{
  "model": "openai/gpt-4o-audio-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Transcribe this audio and summarise the main points."
        },
        {
          "type": "input_audio",
          "input_audio": {
            "data": "BASE64_AUDIO_DATA",
            "format": "wav"
          }
        }
      ]
    }
  ]
}

Best practices for audio

use the exact format supported by the selected model
keep clips reasonably short unless you are using a long-context audio model
avoid sending low-quality or noisy recordings if accurate transcription matters
consider pre-processing speech if your application handles uploads directly

If your chosen model does not support audio input, use a transcription step first, then send the transcript as text.

Video input

Video-capable models can analyze motion, scenes, or temporal content when supported.

Typical use cases:

summarizing short clips
extracting scene-level details
detecting workflow steps
interpreting screen recordings
reasoning about a sequence of frames

Conceptual video request

{
  "model": "google/gemini-2.5-pro",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Summarise this clip and identify the main action."
        },
        {
          "type": "video",
          "video": {
            "url": "https://example.com/demo.mp4"
          }
        }
      ]
    }
  ]
}

Best practices for video

use video-capable models only when necessary
keep clips short for lower cost and faster responses
if possible, trim to the relevant section before upload
for UI or tutorial video analysis, screen captures or key frames may be sufficient
for long videos, consider extracting representative frames or summaries first

Combining text with media

The strongest multimodal results usually come from pairing media with clear instructions.

Instead of asking:

What is this?

Prefer:

Read this invoice image and return:
- invoice number
- invoice date
- subtotal
- tax
- total
- currency

Or:

Look at this product screenshot and identify:
- product name
- visible price
- branding
- likely category

The media gives the model evidence; the text tells it what to do with that evidence.

Example: image + structured output

{
  "model": "openai/gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Extract the invoice number, total, and currency from this image."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/invoice.jpg",
            "detail": "high"
          }
        }
      ]
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "invoice_extraction",
      "schema": {
        "type": "object",
        "properties": {
          "invoice_number": { "type": "string" },
          "total": { "type": "number" },
          "currency": { "type": "string" }
        },
        "required": ["invoice_number", "total", "currency"],
        "additionalProperties": false
      }
    }
  }
}

This is one of the most reliable production patterns for document extraction.

Token usage and cost

Multimodal input increases prompt usage.

Your usage object may include:

prompt_tokens
completion_tokens
total_tokens
cost

Images, files, audio, and video are typically counted as part of prompt_tokens, but the exact accounting depends on the selected model and provider.

Example response fragment

{
  "usage": {
    "prompt_tokens": 441,
    "completion_tokens": 96,
    "total_tokens": 537,
    "cost": 0.00091
  }
}

Practical cost advice

high-resolution images usually cost more than small ones
multiple images increase prompt usage
large schemas plus multimodal input can grow token counts quickly
detailed OCR and extraction tasks usually cost more than simple captioning
long audio or video analysis can become expensive faster than text-only requests

For more on token accounting, see Token Counting.

Full TypeScript extraction example

import OpenAI from "openai";
import { z } from "zod";

const InvoiceSchema = z.object({
  invoice_number: z.string(),
  total: z.number(),
  currency: z.string(),
});

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Extract the invoice number, total, and currency from this image.",
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/invoice.jpg",
            detail: "high",
          },
        },
      ],
    },
  ],
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "invoice_extraction",
      schema: {
        type: "object",
        properties: {
          invoice_number: { type: "string" },
          total: { type: "number" },
          currency: { type: "string" },
        },
        required: ["invoice_number", "total", "currency"],
        additionalProperties: false,
      },
    },
  },
});

const raw = completion.choices[0].message.content ?? "{}";
const parsed = InvoiceSchema.parse(JSON.parse(raw));

console.log(parsed);
console.log(completion.usage);

Full Python extraction example

from openai import OpenAI
from pydantic import BaseModel
import json
import os

class Invoice(BaseModel):
    invoice_number: str
    total: float
    currency: str

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract the invoice number, total, and currency from this image."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/invoice.jpg",
                        "detail": "high"
                    }
                }
            ]
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice_extraction",
            "schema": {
                "type": "object",
                "properties": {
                    "invoice_number": {"type": "string"},
                    "total": {"type": "number"},
                    "currency": {"type": "string"}
                },
                "required": ["invoice_number", "total", "currency"],
                "additionalProperties": False
            }
        }
    }
)

raw = completion.choices[0].message.content or "{}"
parsed = Invoice.model_validate(json.loads(raw))

print(parsed)
print(completion.usage)

Common mistakes

1. Using a text-only model for image input

Not every model can process media. Always verify modality support first.

2. Sending vague instructions

“Describe this” is often much less useful than a task-specific prompt.

3. Ignoring image quality

Low-resolution or blurry images reduce OCR and extraction accuracy.

4. Overloading one request with too much media

Multiple large images, files, and schemas in one request can increase both latency and cost.

5. Skipping output validation

If you use multimodal extraction in production, validate the parsed result with Zod, Pydantic, or equivalent runtime validation.

6. Treating signed URLs as permanent

If you use short-lived or signed URLs, ensure they remain valid long enough for the request to complete.

7. Forgetting privacy implications

Do not send sensitive customer media to a model unless your application, policies, and provider choices allow it.

Best practices

Choose the right model

Use a model that explicitly supports the media type you need.

Give precise instructions

Tell the model exactly what to extract, describe, classify, or summarize.

Prefer structured output for extraction

For invoices, forms, screenshots, and records, combine multimodal input with json_schema.

Keep media focused

Crop images, trim clips, and remove irrelevant pages where possible.

Validate everything

Treat multimodal extraction output like any other external input.

Measure cost with real requests

Inspect usage.cost and calibrate based on actual traffic.

Use signed URLs for private media

Avoid exposing sensitive documents or images publicly when a signed URL or backend-hosted file flow is possible.

Security considerations

Multimodal requests often include user-generated media, which can be sensitive.

Be careful with:

receipts
invoices
identification documents
customer screenshots
internal dashboards
meeting recordings
support attachments

Security recommendations

avoid public long-lived URLs for sensitive media
use short-lived signed URLs when possible
validate uploaded file types before forwarding them
scrub or redact sensitive content if required by policy
log metadata, not raw media, unless you truly need the media for debugging
keep your API key server-side for production integrations

For broader credential guidance, see Security Best Practices.

When to use multimodal vs preprocessing

Multimodal input is powerful, but sometimes preprocessing is the better choice.

Prefer multimodal directly when:

the model needs visual layout
formatting matters
OCR quality must remain high
screenshots or charts are central to the task
audio or video context matters

Prefer preprocessing when:

you only need plain text content
you already have OCR or transcription available
you need lower cost
you want deterministic preprocessing before model inference

Examples

Task	Better approach
screenshot analysis	multimodal
invoice OCR with layout-sensitive fields	multimodal
summarizing a clean transcript	text after preprocessing
classifying support emails with attachments	depends on whether attachment layout matters
meeting summary from recorded call	transcription first, unless audio nuance matters

Next steps

Available Models — browse modality support and model IDs
Token Counting — understand multimodal token usage and cost
Structured Output — return reliable JSON from extraction workflows
Tool Calling — combine multimodal input with live application functions
API Reference — request and response schema details

SOLROUTER

Vision & Multimodal

What “multimodal” means

Supported input types

Text-only vs multimodal messages

Image input

Example request with a remote image

Example request with curl

TypeScript example

Python example

image_url object

Common fields

Example with a detail hint

Data URL images

TypeScript example

Python example

When to use data URLs

File input

Example conceptual request

Practical guidance

Audio input

Conceptual audio request

Best practices for audio

Video input

Conceptual video request

Best practices for video

Combining text with media

Example: image + structured output

Token usage and cost

Example response fragment

Practical cost advice

Full TypeScript extraction example

Full Python extraction example

Common mistakes

1. Using a text-only model for image input

2. Sending vague instructions

3. Ignoring image quality

4. Overloading one request with too much media

5. Skipping output validation

6. Treating signed URLs as permanent

7. Forgetting privacy implications

Best practices

Choose the right model

Give precise instructions

Prefer structured output for extraction

Keep media focused

Validate everything

Measure cost with real requests

Use signed URLs for private media

Security considerations

Security recommendations

When to use multimodal vs preprocessing

Prefer multimodal directly when:

Prefer preprocessing when:

Examples

Next steps

Example request with `curl`

`image_url` object