MULTIMODAL · INFERENCE · COMPUTEINFER / ENGINE

MULTIMODAL INFERENCE · COMPUTE

The inference engine.

Formultimodal AI.

The most bleeding-edge engine for optimized multimodal models. Purpose-built for image, video, audio, and vision — the workloads defining the next decade of AI.

In partnership with

MULTIMODAL · IMAGE · VIDEO · VISION · AUDIO

OPTIMIZED FOR SPEED · QUALITY · COST

03 / THE INFERENCE ENGINE

Build with Infer.

One engine, every modality. Image, video, audio, vision — optimized for speed, quality, and cost. Same outputs as the source API, at lower latency and lower spend.

Image generation

Frontier T2I from Nano Banana 2, GPT Image, FLUX, Imagen, Seedream — same weights, same outputs, sub-2-second latency on most.

Image models →

Video generation

Cinematic 5–10s clips with native audio. Kling 3.0 Pro, Veo 3.1, Seedance 2.0, Hailuo, Wan 2.2 — all on one queue, one bill.

Video models →

Voice & audio

ElevenLabs v3 TTS with 70+ languages and inline emotion tags. Streaming output for IVR, podcasts, and real-time agents.

Audio models →

Vision & segmentation

SAM 3.1 for object segmentation and tracking. Frontier VLMs for document, video, and multimodal understanding at production speed.

Vision models →

Edit & restoration

Instruction-driven image editing with FLUX.1 Kontext — restore, restyle, swap. Identity preservation across multi-step edits.

Editing models →

Custom & open weights

Bring your own LoRA or fine-tune; serve open-weights models at frontier-grade throughput. Wan, FLUX dev, Qwen Image — same SDK shape.

Open-weights →

04 / NVIDIA PARTNERSHIP

Backed by NVIDIA.
Built on the latest silicon.

We work hand-in-hand with NVIDIA on inference. That gets us early access to their newest accelerators and CUDA stack — and gets our researchers and engineers in the room with theirs to optimize Infer for whatever ships next. The work flows both ways.

Jensen Huang on stage at NVIDIA's CES 2025 keynote, with the AI evolution arc — Perception → Generative → Agentic → Physical AI.

CES 2025 KEYNOTE · LAS VEGAS

01 / EARLY ACCESSHardware as it ships.We get our hands on NVIDIA's newest accelerators ahead of general availability — and on the CUDA, TensorRT, and compiler advances that come with them. Infer runs on what's next, not what's in stock.

02 / JOINT ENGINEERINGEngineers in the same room.Our researchers and engineers work directly with NVIDIA's inference and kernel teams to optimize Infer for each new architecture. Bottlenecks get profiled jointly; fixes land upstream and downstream simultaneously.

03 / DEEP STACKOptimized end to end.From kernel-level fusions to scheduler-aware batching to streaming protocol — every layer is tuned against NVIDIA hardware specifics. Generic inference servers can't match it because they don't have the relationship to know what to tune for.

05 / DEDICATED COMPUTE

Custom models. Dedicated GPUs.

FOR ENTERPRISE WORKLOADS

Bring your weights or fine-tune ours. Reserve capacity on the same hardware Infer runs on — Hopper today, Blackwell as it ships. Optimized stack, dedicated nodes per tenant, uptime measured against production SLAs.

H100 · H200 · NVLAVAILABLE NOW

Hopper

141 GB HBM3e · 4.8 TB/s · NVLink

Frontier-grade throughput today. The H200 cluster runs the bulk of our serverless catalog; reserve dedicated nodes for hot LoRAs, regulated workloads, or 24/7 production loads.

MEMORY141 GB

BANDWIDTH4.8 TB/s

FP8 TFLOPS3,958

B200 · GB200 · NVL72SHIPPING IN WAVES

Blackwell

192 GB HBM3e · 8 TB/s · 5th-gen NVLink

Next-generation compute as it lands at the foundry. First capacity is reserved for enterprise customers and frontier-model partners; talk to us before the queue closes for the next wave.

MEMORY192 GB

BANDWIDTH8 TB/s

FP4 TFLOPS20,000

CUSTOM MODELSBring your weights.LoRAs and full checkpoints, hosted on the same engine as the public catalog with the same SDK shape. Pay only for the calls the endpoint serves; no idle charges.

RESERVED CAPACITYPinned to your workload.Dedicated nodes per tenant — predictable latency, predictable cost, no shared-tenancy surprises. Scale up and down without re-provisioning.

UPTIME99.95% on dedicated tiers.Measured against production SLAs. Incident credits in the standard contract; named engineer on call for enterprise tiers.

OPTIMIZED STACKTuned for the silicon you're on.Kernel-level fusions, scheduler-aware batching, and quantization paths re-tuned for each new architecture. Same stack we run on the serverless catalog.

06 / CHEAPER BY DESIGN

Same model. Lower bill.

Every model in the catalog is priced at least 20% under its source API. Same weights, same outputs, same SLA — the savings come from how Infer runs the model, not from cutting features.

MODELMODALITYVS SOURCE API

Nano Banana 2Image−22%

FLUX 1.1 [pro]Image−20%

Ideogram 3.0Image · typography−25%

Seedance 2.0 ProVideo + audio−20%

Kling 3.0 ProVideo−24%

Wan 2.2 (open weights)Video · open−23%

07 / DEVELOPER EXPERIENCE

Three lines
to production.// no setup required

Type-safe SDKs. OpenAPI spec. Streaming. Webhooks. Everything you need to ship fast — and nothing you don't.

example.pySTREAMING
 1import infer 2 3# Initialize with your API key 4client = infer.Client() 5 6# Generate with any of 60+ models 7result = client.run("flux-schnell", { 8    "prompt": "A futuristic city at sunset", 9    "width":  1024,10    "height": 102411})1213# That's it. No infra, no queues.14print(result.url)
 1import Infer from "@infer/sdk"; 2 3// Initialize with your API key 4const client = new Infer(); 5 6// Generate with any of 60+ models 7const result = await client.run("flux-schnell", { 8    prompt: "A futuristic city at sunset", 9    width:  1024,10    height: 1024,11});1213// Type-safe. Streaming-ready.14console.log(result.url);
 1package main 2 3import "github.com/infer/sdk-go" 4 5func main() { 6    client := infer.NewClient() 7 8    // Run any of 60+ models 9    result, _ := client.Run("flux-schnell", infer.Params{10        Prompt: "A futuristic city at sunset",11        Width:  1024,12        Height: 1024,13    })14    fmt.Println(result.URL)15}
 1# Works with any HTTP client 2curl https://api.infer.sh/v1/run \ 3  -H "Authorization: Bearer $INFER_KEY" \ 4  -H "Content-Type: application/json" \ 5  -d '{ 6    "model":  "flux-schnell", 7    "prompt": "A futuristic city at sunset", 8    "width":  1024, 9    "height": 102410  }'1112# Response streams back over HTTP/213# X-Infer-Latency: 1243ms

08 / ENTERPRISE

Built for
enterprise.

Reserved capacity, private endpoints, compliance packages — available as an add-on for teams operating at serious scale. Not included in Pro; talk to us and we'll scope what you need.

Book a scoping call →support@tryinfer.com

TYPICAL ENTERPRISE ENGAGEMENT

01Scope & compliance review~1 wk
02Reserved capacity provisioning~3 days
03Integration & private endpoints~1 wk
04Go-live with named engineerday 1

AVAILABLE ON ENTERPRISE

SOC 2 Type II

Audited annually. Report under NDA.

Private endpoints

Dedicated IPs, VPC peering, no shared tenancy.

SSO · SAML · SCIM

Okta, Azure AD, Google — auto-provisioning.

Custom SLA

Named engineer, incident credits, audit logs.

09 / CASE STUDY

Reka×infer

REKA EDGE · ON INFER

EDGE · PHYSICAL AI

Reka Edge.
The fastest on-device VLM, served on Infer.

Reka Edge is frontier-level edge intelligence for physical AI — a 7B vision-language model fast and lean enough to run on a drone, a Jetson, a car, or a wrist. Real-time spatial reasoning and object localization without cloud connectivity.

Reka built the model. Infer serves it. The same engine powering every other model in our catalog — one SDK shape, one queue, one bill — with the throughput-per-dollar that makes physical-AI deployment economically viable.

5.46 img/s

2× FASTER
THAN COMPARABLE 7Bs

522 ms

TIME TO
FIRST TOKEN

3×

FEWER TOKENS
PER 1024² IMAGE

Visit Reka →reka.ai

10 / FAQ

Things teams ask
before signing.

The security-review and procurement questions, handled up front.

Still have questions? Book a 30-minute call →
Or email support@tryinfer.com — usually within an hour.

Q.01How is Infer different from Replicate, Fal, or Runware?

Three things: the engine (we built Infer in-house to run multimodal models efficiently — same outputs, less compute), pricing (every model in the catalog is at least 20% cheaper than the source API; no minimums, no idle charges), and API shape (one SDK across the entire catalog, drop-in compatible with fal/Replicate so the migration is one URL change).

Q.02Do you train on my prompts or outputs?

No. Zero data retention by default. We don't log prompts, we don't store outputs past the response, and we don't use any inputs for training. Enterprise gets a BAA and private VPC on top.

Q.03Can I run my own fine-tuned model?

Yes — bring your weights and we'll host them on Infer with the same SDK shape as the rest of the catalog. LoRAs and full checkpoints both supported. Pay only for the calls the endpoint serves; no idle charges. Talk to us to scope it.

Q.04What happens at 10M+ requests / month?

You get pulled into our volume tier — additional discount on top of the standard catalog price, plus a named support engineer and reserved capacity if you need it. No contract gymnastics; talk to us when you're close.

Q.05Where does inference run, and can I pin a region?

Inference runs in our own GPU clusters today, with regional expansion in progress. EU-only data residency is available on Enterprise. Tell us your routing requirements and we'll scope what fits.

Q.06What's your uptime story?

We're a new platform — we don't make 12-month uptime claims we can't back yet. What we will commit to: a public status page once we have enough operational history to make it useful, an incident-credit policy in our standard terms, and direct Slack access for production teams running real volume.

11 / DAY-0 ACCESS

State-of-the-art models. Served on day zero.

When a new frontier model lands, it lands here. We track every leaderboard worth tracking — VBench, the Artificial Analysis arenas, MMMU-Pro, OpenVLM — and serve the top of each the day they ship.

View the leaderboards→

01GPT Image 2IMAGE

02GPT Image 1.5IMAGE

03Nano Banana 2IMAGEON INFER

01HappyHorse-1.0VIDEO

02Seedance 2.0VIDEOON INFER

03Kling 3.0 ProVIDEOON INFER

01GPT-5.4 ProVLM

02Claude MythosVLM

03Gemini 3.1 ProVLM

The inference engine.

Formultimodal AI.

60+ models. One interface.