Your system works locally. Now make it accessible, reliable, and scalable.
Package your app with Docker. Include your application code, dependencies, and configuration. Don’t include model weights in the image — load them at startup or serve from a separate inference endpoint.
Pick based on your scale:
| Option | When to use |
|---|---|
| Serverless (AWS Lambda, Google Cloud Functions, Vercel, Modal) | Low-to-medium traffic. Zero infra management. Cold starts can hurt TTFT. |
| Container platforms (Cloud Run, ECS, Fly.io, Railway) | More control, persistent processes. Better for streaming and WebSockets. |
| GPU instances (AWS/GCP/Lambda Labs/RunPod) | Self-hosted models. Use vLLM or TGI as the serving layer. |
LLM backends have different bottlenecks than traditional APIs — latency is dominated by inference time, not your code. Test with realistic payloads. Know your throughput limit before users find it.
Combine LLMOps (Module 16) with traditional infrastructure metrics: CPU/GPU utilisation, memory, request queue depth, error rates. Set up alerts before launch, not after the first incident.
Docker docs · Modal · Fly.io · vLLM · TGI · LitServe