Skip to Content

18. Deployment and Infrastructure

Your system works locally. Now make it accessible, reliable, and scalable.

Containerisation

Package your app with Docker. Include your application code, dependencies, and configuration. Don’t include model weights in the image — load them at startup or serve from a separate inference endpoint.

Deployment Options

Pick based on your scale:

OptionWhen to use
Serverless (AWS Lambda, Google Cloud Functions, Vercel, Modal)Low-to-medium traffic. Zero infra management. Cold starts can hurt TTFT.
Container platforms (Cloud Run, ECS, Fly.io, Railway)More control, persistent processes. Better for streaming and WebSockets.
GPU instances (AWS/GCP/Lambda Labs/RunPod)Self-hosted models. Use vLLM or TGI as the serving layer.

API Design for LLM Backends

Load Testing

LLM backends have different bottlenecks than traditional APIs — latency is dominated by inference time, not your code. Test with realistic payloads. Know your throughput limit before users find it.

Infrastructure Monitoring

Combine LLMOps (Module 16) with traditional infrastructure metrics: CPU/GPU utilisation, memory, request queue depth, error rates. Set up alerts before launch, not after the first incident.

Resources

Docker docs · Modal · Fly.io · vLLM · TGI · LitServe