Vision & Multimodal


SolRouter supports multimodal requests for models that can process more than plain text. Depending on the selected model, you can send:

  • text
  • images
  • files
  • audio
  • video

This page explains how multimodal input works, how to structure requests, how to choose the right model, and what to watch for in production.

Base URL

https://api.solrouter.io/ai

What “multimodal” means

A multimodal model can accept multiple input types in a single request.

Examples:

  • ask a model to describe an image
  • extract data from a PDF or invoice
  • analyze a chart screenshot
  • summarize a video clip
  • transcribe or reason about audio
  • combine text instructions with an attached image or file

In SolRouter, multimodal requests use the same chat completions API as text-only requests. The main difference is that the content field of a message can become an array of typed input blocks instead of a single string.


Supported input types

The exact capabilities depend on the selected model, but the common multimodal input categories are:

Input typeDescriptionExample use cases
textPlain text instructions or conversation historychat, extraction, summarization
image_urlRemote image URL or data URLOCR, screenshot analysis, chart explanation
fileStructured or unstructured document input where supportedinvoices, PDFs, reports
input_audioAudio input where supportedtranscription, summarization, voice analysis
videoVideo-capable input on supported modelsscene understanding, clip summarization

To see whether a specific model supports image, file, audio, or video input, check the Models catalogue or the Available Models documentation.


Text-only vs multimodal messages

A plain text message looks like this:

{
  "role": "user",
  "content": "Summarise this document."
}

A multimodal message uses an array of content blocks:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What is shown in this image?"
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/chart.png"
      }
    }
  ]
}

This lets you combine instructions and media in a single request.


Image input

Image input is the most common multimodal workflow. You can send:

  • a public remote image URL
  • a signed temporary URL
  • a data: URI with base64-encoded image data

Example request with a remote image

{
  "model": "openai/gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe the chart and identify the overall trend."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/chart.png"
          }
        }
      ]
    }
  ]
}

Example request with curl

curl https://api.solrouter.io/ai/chat/completions \
  -H "Authorization: Bearer $SOLROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/photo.jpg"
            }
          }
        ]
      }
    ]
  }'

TypeScript example

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "What is happening in this image?",
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/photo.jpg",
          },
        },
      ],
    },
  ],
});

console.log(completion.choices[0].message.content);

Python example

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is happening in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/photo.jpg"
                    }
                }
            ]
        }
    ]
)

print(completion.choices[0].message.content)

image_url object

The most common image block shape is:

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image.png"
  }
}

Common fields

FieldTypeRequiredDescription
typestringYesMust be image_url
image_url.urlstringYesPublic or signed image URL, or a data URI
image_url.detailstringNoOptional detail hint such as low, high, or auto, where supported

Example with a detail hint

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/receipt.jpg",
    "detail": "high"
  }
}

Use a higher detail mode when you need:

  • OCR
  • small text extraction
  • invoice parsing
  • dense charts
  • UI screenshots with many labels

Use lower detail when you only need:

  • coarse scene understanding
  • simple object recognition
  • quick classification

Data URL images

If your image is local or user-uploaded, you can embed it as a data: URL.

TypeScript example

import fs from "node:fs";

const bytes = fs.readFileSync("receipt.jpg");
const base64 = bytes.toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Extract the invoice total and invoice number.",
        },
        {
          type: "image_url",
          image_url: {
            url: dataUrl,
          },
        },
      ],
    },
  ],
});

Python example

import base64
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

with open("receipt.jpg", "rb") as f:
    encoded = base64.b64encode(f.read()).decode("utf-8")

data_url = f"data:image/jpeg;base64,{encoded}"

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract the invoice total and invoice number."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": data_url
                    }
                }
            ]
        }
    ]
)

print(completion.choices[0].message.content)

When to use data URLs

Use data URLs when:

  • the file is local
  • the image is private
  • you do not want to expose a public URL
  • you are handling user uploads directly in your backend

Avoid data URLs when the media is very large, because request payloads can become heavy and increase latency.


File input

Some models support direct file or document-style input. This is useful for:

  • PDFs
  • reports
  • invoices
  • technical documents
  • policy documents
  • long-form context

The exact file block format depends on provider capabilities, but the common pattern is a typed content block alongside a text instruction.

Example conceptual request

{
  "model": "anthropic/claude-sonnet-4",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Extract all invoice line items from this file."
        },
        {
          "type": "file",
          "file": {
            "url": "https://example.com/invoice.pdf"
          }
        }
      ]
    }
  ]
}

Practical guidance

For file-heavy workflows:

  • use models that explicitly support file input
  • prefer signed or short-lived URLs for private documents
  • keep file size reasonable
  • consider splitting large documents into smaller chunks when possible
  • use structured output to make extraction reliable

If a model does not support file input directly, convert the content to another supported representation, such as:

  • extracted text
  • page images
  • OCR output plus original prompt

Audio input

Audio-capable models can process voice or sound input where supported by the selected model.

Typical use cases:

  • transcription
  • meeting summary
  • call notes extraction
  • audio classification
  • voice assistant input

Conceptual audio request

{
  "model": "openai/gpt-4o-audio-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Transcribe this audio and summarise the main points."
        },
        {
          "type": "input_audio",
          "input_audio": {
            "data": "BASE64_AUDIO_DATA",
            "format": "wav"
          }
        }
      ]
    }
  ]
}

Best practices for audio

  • use the exact format supported by the selected model
  • keep clips reasonably short unless you are using a long-context audio model
  • avoid sending low-quality or noisy recordings if accurate transcription matters
  • consider pre-processing speech if your application handles uploads directly

If your chosen model does not support audio input, use a transcription step first, then send the transcript as text.


Video input

Video-capable models can analyze motion, scenes, or temporal content when supported.

Typical use cases:

  • summarizing short clips
  • extracting scene-level details
  • detecting workflow steps
  • interpreting screen recordings
  • reasoning about a sequence of frames

Conceptual video request

{
  "model": "google/gemini-2.5-pro",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Summarise this clip and identify the main action."
        },
        {
          "type": "video",
          "video": {
            "url": "https://example.com/demo.mp4"
          }
        }
      ]
    }
  ]
}

Best practices for video

  • use video-capable models only when necessary
  • keep clips short for lower cost and faster responses
  • if possible, trim to the relevant section before upload
  • for UI or tutorial video analysis, screen captures or key frames may be sufficient
  • for long videos, consider extracting representative frames or summaries first

Combining text with media

The strongest multimodal results usually come from pairing media with clear instructions.

Instead of asking:

What is this?

Prefer:

Read this invoice image and return:
- invoice number
- invoice date
- subtotal
- tax
- total
- currency

Or:

Look at this product screenshot and identify:
- product name
- visible price
- branding
- likely category

The media gives the model evidence; the text tells it what to do with that evidence.

Example: image + structured output

{
  "model": "openai/gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Extract the invoice number, total, and currency from this image."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/invoice.jpg",
            "detail": "high"
          }
        }
      ]
    }
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "invoice_extraction",
      "schema": {
        "type": "object",
        "properties": {
          "invoice_number": { "type": "string" },
          "total": { "type": "number" },
          "currency": { "type": "string" }
        },
        "required": ["invoice_number", "total", "currency"],
        "additionalProperties": false
      }
    }
  }
}

This is one of the most reliable production patterns for document extraction.


Token usage and cost

Multimodal input increases prompt usage.

Your usage object may include:

  • prompt_tokens
  • completion_tokens
  • total_tokens
  • cost

Images, files, audio, and video are typically counted as part of prompt_tokens, but the exact accounting depends on the selected model and provider.

Example response fragment

{
  "usage": {
    "prompt_tokens": 441,
    "completion_tokens": 96,
    "total_tokens": 537,
    "cost": 0.00091
  }
}

Practical cost advice

  • high-resolution images usually cost more than small ones
  • multiple images increase prompt usage
  • large schemas plus multimodal input can grow token counts quickly
  • detailed OCR and extraction tasks usually cost more than simple captioning
  • long audio or video analysis can become expensive faster than text-only requests

For more on token accounting, see Token Counting.


Full TypeScript extraction example

import OpenAI from "openai";
import { z } from "zod";

const InvoiceSchema = z.object({
  invoice_number: z.string(),
  total: z.number(),
  currency: z.string(),
});

const client = new OpenAI({
  baseURL: "https://api.solrouter.io/ai",
  apiKey: process.env.SOLROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Extract the invoice number, total, and currency from this image.",
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/invoice.jpg",
            detail: "high",
          },
        },
      ],
    },
  ],
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "invoice_extraction",
      schema: {
        type: "object",
        properties: {
          invoice_number: { type: "string" },
          total: { type: "number" },
          currency: { type: "string" },
        },
        required: ["invoice_number", "total", "currency"],
        additionalProperties: false,
      },
    },
  },
});

const raw = completion.choices[0].message.content ?? "{}";
const parsed = InvoiceSchema.parse(JSON.parse(raw));

console.log(parsed);
console.log(completion.usage);

Full Python extraction example

from openai import OpenAI
from pydantic import BaseModel
import json
import os

class Invoice(BaseModel):
    invoice_number: str
    total: float
    currency: str

client = OpenAI(
    base_url="https://api.solrouter.io/ai",
    api_key=os.environ["SOLROUTER_API_KEY"],
)

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract the invoice number, total, and currency from this image."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/invoice.jpg",
                        "detail": "high"
                    }
                }
            ]
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice_extraction",
            "schema": {
                "type": "object",
                "properties": {
                    "invoice_number": {"type": "string"},
                    "total": {"type": "number"},
                    "currency": {"type": "string"}
                },
                "required": ["invoice_number", "total", "currency"],
                "additionalProperties": False
            }
        }
    }
)

raw = completion.choices[0].message.content or "{}"
parsed = Invoice.model_validate(json.loads(raw))

print(parsed)
print(completion.usage)

Common mistakes

1. Using a text-only model for image input

Not every model can process media. Always verify modality support first.

2. Sending vague instructions

“Describe this” is often much less useful than a task-specific prompt.

3. Ignoring image quality

Low-resolution or blurry images reduce OCR and extraction accuracy.

4. Overloading one request with too much media

Multiple large images, files, and schemas in one request can increase both latency and cost.

5. Skipping output validation

If you use multimodal extraction in production, validate the parsed result with Zod, Pydantic, or equivalent runtime validation.

6. Treating signed URLs as permanent

If you use short-lived or signed URLs, ensure they remain valid long enough for the request to complete.

7. Forgetting privacy implications

Do not send sensitive customer media to a model unless your application, policies, and provider choices allow it.


Best practices

Choose the right model

Use a model that explicitly supports the media type you need.

Give precise instructions

Tell the model exactly what to extract, describe, classify, or summarize.

Prefer structured output for extraction

For invoices, forms, screenshots, and records, combine multimodal input with json_schema.

Keep media focused

Crop images, trim clips, and remove irrelevant pages where possible.

Validate everything

Treat multimodal extraction output like any other external input.

Measure cost with real requests

Inspect usage.cost and calibrate based on actual traffic.

Use signed URLs for private media

Avoid exposing sensitive documents or images publicly when a signed URL or backend-hosted file flow is possible.


Security considerations

Multimodal requests often include user-generated media, which can be sensitive.

Be careful with:

  • receipts
  • invoices
  • identification documents
  • customer screenshots
  • internal dashboards
  • meeting recordings
  • support attachments

Security recommendations

  • avoid public long-lived URLs for sensitive media
  • use short-lived signed URLs when possible
  • validate uploaded file types before forwarding them
  • scrub or redact sensitive content if required by policy
  • log metadata, not raw media, unless you truly need the media for debugging
  • keep your API key server-side for production integrations

For broader credential guidance, see Security Best Practices.


When to use multimodal vs preprocessing

Multimodal input is powerful, but sometimes preprocessing is the better choice.

Prefer multimodal directly when:

  • the model needs visual layout
  • formatting matters
  • OCR quality must remain high
  • screenshots or charts are central to the task
  • audio or video context matters

Prefer preprocessing when:

  • you only need plain text content
  • you already have OCR or transcription available
  • you need lower cost
  • you want deterministic preprocessing before model inference

Examples

TaskBetter approach
screenshot analysismultimodal
invoice OCR with layout-sensitive fieldsmultimodal
summarizing a clean transcripttext after preprocessing
classifying support emails with attachmentsdepends on whether attachment layout matters
meeting summary from recorded calltranscription first, unless audio nuance matters

Next steps