LLM calls are slow and expensive. Caching is the single highest-leverage optimisation for both.
| Strategy | How it works | Best for |
|---|---|---|
| Exact-match | Hash the prompt, store the response, return on cache hit | Deterministic queries (extraction, classification) |
| Semantic | Embed the query, match against cached embeddings by similarity | Paraphrased questions |
| Prompt caching (provider-level) | Anthropic/OpenAI cache repeated prefixes server-side at reduced rates | Large system prompts, repeated context |
| KV-cache reuse | Reuse attention key-value cache for shared prefixes (self-hosting) | High-throughput serving with vLLM |
For exact-match: use Redis, memcached, or a simple database table.
For semantic: libraries like GPTCache provide this out of the box.
For prompt caching: no code change needed beyond structuring your prompts so the static parts come first.
max_tokens appropriately. Don’t pay for 4000 tokens when 200 will do.!!! info “Track everything” Track cost per feature, per user, per query tier. If you can’t attribute cost, you can’t optimise it.