MULTIMODAL · INFERENCE · COMPUTEINFER / ENGINE v2.4
MULTIMODAL INFERENCE · COMPUTE

The inference engine.

Formultimodal AI.

The most bleeding-edge engine for optimized multimodal models. Purpose-built for image, video, audio, and vision — the workloads defining the next decade of AI.

In partnership with
MULTIMODAL · IMAGE · VIDEO · VISION · AUDIO
OPTIMIZED FOR SPEED · QUALITY · COST
TOPOLOGY.MAPv2.4 / UNIFIEDN=8 / 100+→ ACTIVE▶ VIDEOkling-3.0-proVIDEO · 1080pseedance-2.0VIDEO · +AUDIO◉ IMAGEnano-banana-2IMAGE · ~1.4sflux-1.1-proIMAGE · ~2.5s▶ VIDEOveo-3.1-fastVIDEO · +AUDIO◉ IMAGEideogram-3.0IMAGE · TYPO◉ IMAGEgpt-image-1.5IMAGE · ~5shailuo-02-proVIDEO · VALUEUNIFIED APIONLINE· v2.4
01 / TRY IT NOW
INTERACTIVE · LIVE INFERENCE
Model 5 AVAILABLE
Prompt168 / 500
5s
720p
847291
Generate in the app →
~5s · 1080p$0.32 / SEC
Output
16:9LIVE
KLING-3.0-PRO5s
1080p · ANIME
MODELS ON INFER
frontier & open-weights
GoogleOpenAIBlack Forest LabsByteDanceKuaishouMiniMaxAlibabaIdeogramElevenLabsReka
+ 100 MORE →
02 / EXPLORE MODELS

100+ models. One interface.

View all models →
N
nano-banana-2

Google's flagship T2I — fastest path to photoreal

~1.4sIMAGE
F
flux-1.1-pro

BFL's production T2I baseline. Broad style range.

~2.5sIMAGE
I
ideogram-3.0

Best-in-class typography for posters & ads.

~3.5sIMAGE
S
seedream-4.0

ByteDance's multi-reference image model.

~2.8sIMAGE
K
kling-3.0-pro

Cinematic stylized video, 1080p, native 4K.

5–10sVIDEO
S
seedance-2.0-pro

Director-grade video with native synced audio.

5–10sVIDEO
M
sam-3.1

Segment anything in image and video. Open weights.

~0.4sVISION
E
elevenlabs-v3

TTS with 70+ languages and inline emotion tags.

~1.2sAUDIO
03 / THE INFERENCE ENGINE

Build with Infer.

One engine, every modality. Image, video, audio, vision — optimized for speed, quality, and cost. Same outputs as the source API, at lower latency and lower spend.

Image generation

Frontier T2I from Nano Banana 2, GPT Image, FLUX, Imagen, Seedream — same weights, same outputs, sub-2-second latency on most.

Image models →

Video generation

Cinematic 5–10s clips with native audio. Kling 3.0 Pro, Veo 3.1, Seedance 2.0, Hailuo, Wan 2.2 — all on one queue, one bill.

Video models →

Voice & audio

ElevenLabs v3 TTS with 70+ languages and inline emotion tags. Streaming output for IVR, podcasts, and real-time agents.

Audio models →

Vision & segmentation

SAM 3.1 for object segmentation and tracking. Frontier VLMs for document, video, and multimodal understanding at production speed.

Vision models →

Edit & restoration

Instruction-driven image editing with FLUX.1 Kontext — restore, restyle, swap. Identity preservation across multi-step edits.

Editing models →

Custom & open weights

Bring your own LoRA or fine-tune; serve open-weights models at frontier-grade throughput. Wan, FLUX dev, Qwen Image — same SDK shape.

Open-weights →
04 / NVIDIA PARTNERSHIP

Backed by NVIDIA.
Built on the latest silicon.

We work hand-in-hand with NVIDIA on inference. That gets us early access to their newest accelerators and CUDA stack — and gets our researchers and engineers in the room with theirs to optimize Infer for whatever ships next. The work flows both ways.

Jensen Huang on stage at NVIDIA's CES 2025 keynote, with the AI evolution arc — Perception → Generative → Agentic → Physical AI.
CES 2025 KEYNOTE · LAS VEGAS
01 / EARLY ACCESSHardware as it ships.We get our hands on NVIDIA's newest accelerators ahead of general availability — and on the CUDA, TensorRT, and compiler advances that come with them. Infer runs on what's next, not what's in stock.
02 / JOINT ENGINEERINGEngineers in the same room.Our researchers and engineers work directly with NVIDIA's inference and kernel teams to optimize Infer for each new architecture. Bottlenecks get profiled jointly; fixes land upstream and downstream simultaneously.
03 / DEEP STACKOptimized end to end.From kernel-level fusions to scheduler-aware batching to streaming protocol — every layer is tuned against NVIDIA hardware specifics. Generic inference servers can't match it because they don't have the relationship to know what to tune for.
05 / DEDICATED COMPUTE

Custom models. Dedicated GPUs.

FOR ENTERPRISE WORKLOADS

Bring your weights or fine-tune ours. Reserve capacity on the same hardware Infer runs on — Hopper today, Blackwell as it ships. Optimized stack, dedicated nodes per tenant, uptime measured against production SLAs.

H100 · H200 · NVLAVAILABLE NOW

Hopper

141 GB HBM3e · 4.8 TB/s · NVLink

Frontier-grade throughput today. The H200 cluster runs the bulk of our serverless catalog; reserve dedicated nodes for hot LoRAs, regulated workloads, or 24/7 production loads.

MEMORY141 GB
BANDWIDTH4.8 TB/s
FP8 TFLOPS3,958
B200 · GB200 · NVL72SHIPPING IN WAVES

Blackwell

192 GB HBM3e · 8 TB/s · 5th-gen NVLink

Next-generation compute as it lands at the foundry. First capacity is reserved for enterprise customers and frontier-model partners; talk to us before the queue closes for the next wave.

MEMORY192 GB
BANDWIDTH8 TB/s
FP4 TFLOPS20,000
CUSTOM MODELSBring your weights.LoRAs and full checkpoints, hosted on the same engine as the public catalog with the same SDK shape. Pay only for the calls the endpoint serves; no idle charges.
RESERVED CAPACITYPinned to your workload.Dedicated nodes per tenant — predictable latency, predictable cost, no shared-tenancy surprises. Scale up and down without re-provisioning.
UPTIME99.95% on dedicated tiers.Measured against production SLAs. Incident credits in the standard contract; named engineer on call for enterprise tiers.
OPTIMIZED STACKTuned for the silicon you're on.Kernel-level fusions, scheduler-aware batching, and quantization paths re-tuned for each new architecture. Same stack we run on the serverless catalog.
06 / CHEAPER BY DESIGN

Same model. Lower bill.

Every model in the catalog is priced at least 20% under its source API. Same weights, same outputs, same SLA — the savings come from how Infer runs the model, not from cutting features.

MODELMODALITYVS SOURCE API
Nano Banana 2Image22%
FLUX 1.1 [pro]Image20%
Ideogram 3.0Image · typography25%
Seedance 2.0 ProVideo + audio20%
Kling 3.0 ProVideo24%
Wan 2.2 (open weights)Video · open23%
07
07 / DEVELOPER EXPERIENCE

Three lines
to production.// no setup required

Type-safe SDKs. OpenAPI spec. Streaming. Webhooks. Everything you need to ship fast — and nothing you don't.

example.pySTREAMING
 1import infer 2 3# Initialize with your API key 4client = infer.Client() 5 6# Generate with any of 100+ models 7result = client.run("flux-schnell", { 8    "prompt": "A futuristic city at sunset", 9    "width":  1024,10    "height": 102411})1213# That's it. No infra, no queues.14print(result.url)
 1import Infer from "@infer/sdk"; 2 3// Initialize with your API key 4const client = new Infer(); 5 6// Generate with any of 100+ models 7const result = await client.run("flux-schnell", { 8    prompt: "A futuristic city at sunset", 9    width:  1024,10    height: 1024,11});1213// Type-safe. Streaming-ready.14console.log(result.url);
 1package main 2 3import "github.com/infer/sdk-go" 4 5func main() { 6    client := infer.NewClient() 7 8    // Run any of 100+ models 9    result, _ := client.Run("flux-schnell", infer.Params{10        Prompt: "A futuristic city at sunset",11        Width:  1024,12        Height: 1024,13    })14    fmt.Println(result.URL)15}
 1# Works with any HTTP client 2curl https://api.infer.sh/v1/run \ 3  -H "Authorization: Bearer $INFER_KEY" \ 4  -H "Content-Type: application/json" \ 5  -d '{ 6    "model":  "flux-schnell", 7    "prompt": "A futuristic city at sunset", 8    "width":  1024, 9    "height": 102410  }'1112# Response streams back over HTTP/213# X-Infer-Latency: 1243ms
08 / ENTERPRISE

Built for
enterprise.

Reserved capacity, private endpoints, compliance packages — available as an add-on for teams operating at serious scale. Not included in Pro; talk to us and we'll scope what you need.

TYPICAL ENTERPRISE ENGAGEMENT
  • 01Scope & compliance review~1 wk
  • 02Reserved capacity provisioning~3 days
  • 03Integration & private endpoints~1 wk
  • 04Go-live with named engineerday 1
AVAILABLE ON ENTERPRISE
SOC 2 Type II
Audited annually. Report under NDA.
ISO 27001
Certified ISMS & audit packages.
HIPAA · GDPR
BAA + EU data residency on request.
Private endpoints
Dedicated IPs, VPC peering, no shared tenancy.
SSO · SAML · SCIM
Okta, Azure AD, Google — auto-provisioning.
Custom SLA
Named engineer, incident credits, audit logs.
09 / CASE STUDY
Reka×infer
REKA EDGE · ON INFER
EDGE · PHYSICAL AI

Reka Edge.
The fastest on-device VLM, served on Infer.

Reka Edge is frontier-level edge intelligence for physical AI — a 7B vision-language model fast and lean enough to run on a drone, a Jetson, a car, or a wrist. Real-time spatial reasoning and object localization without cloud connectivity.

Reka built the model. Infer serves it. The same engine powering every other model in our catalog — one SDK shape, one queue, one bill — with the throughput-per-dollar that makes physical-AI deployment economically viable.

5.46 img/s
2× FASTER
THAN COMPARABLE 7Bs
522 ms
TIME TO
FIRST TOKEN
3×
FEWER TOKENS
PER 1024² IMAGE
10 / FAQ

Things teams ask
before signing.

The security-review and procurement questions, handled up front.

Still have questions? Book a 30-minute call →
Or email team@infer.sh — usually within an hour.
Q.01How is Infer different from Replicate, Fal, or Runware?
Three things: the engine (we built Infer in-house to run multimodal models efficiently — same outputs, less compute), pricing (every model in the catalog is at least 20% cheaper than the source API; no minimums, no idle charges), and API shape (one SDK across the entire catalog, drop-in compatible with fal/Replicate so the migration is one URL change).
Q.02Do you train on my prompts or outputs?
No. Zero data retention by default. We don't log prompts, we don't store outputs past the response, and we don't use any inputs for training. Enterprise gets a BAA and private VPC on top.
Q.03Can I run my own fine-tuned model?
Yes — bring your weights and we'll host them on Infer with the same SDK shape as the rest of the catalog. LoRAs and full checkpoints both supported. Pay only for the calls the endpoint serves; no idle charges. Talk to us to scope it.
Q.04What happens at 10M+ requests / month?
You get pulled into our volume tier — additional discount on top of the standard catalog price, plus a named support engineer and reserved capacity if you need it. No contract gymnastics; talk to us when you're close.
Q.05Where does inference run, and can I pin a region?
Inference runs in our own GPU clusters today, with regional expansion in progress. EU-only data residency is available on Enterprise. Tell us your routing requirements and we'll scope what fits.
Q.06What's your uptime story?
We're a new platform — we don't make 12-month uptime claims we can't back yet. What we will commit to: a public status page once we have enough operational history to make it useful, an incident-credit policy in our standard terms, and direct Slack access for production teams running real volume.
11 / DAY-0 ACCESS

State-of-the-art models. Served on day zero.

When a new frontier model lands, it lands here. We track every leaderboard worth tracking — VBench, the Artificial Analysis arenas, MMMU-Pro, OpenVLM — and serve the top of each the day they ship.

View the leaderboards
01GPT Image 2IMAGE
02GPT Image 1.5IMAGE
03Nano Banana 2IMAGEON INFER
01HappyHorse-1.0VIDEO
02Seedance 2.0VIDEOON INFER
03Kling 3.0 ProVIDEOON INFER
01GPT-5.4 ProVLM
02Claude MythosVLM
03Gemini 3.1 ProVLM
FINAL CALL

Ready toship?

Start building for free. No credit card required.
Scale when you're ready.