Skip to Content

17. Caching and Cost Optimisation

LLM calls are slow and expensive. Caching is the single highest-leverage optimisation for both.

Caching Strategies

StrategyHow it worksBest for
Exact-matchHash the prompt, store the response, return on cache hitDeterministic queries (extraction, classification)
SemanticEmbed the query, match against cached embeddings by similarityParaphrased questions
Prompt caching (provider-level)Anthropic/OpenAI cache repeated prefixes server-side at reduced ratesLarge system prompts, repeated context
KV-cache reuseReuse attention key-value cache for shared prefixes (self-hosting)High-throughput serving with vLLM

For exact-match: use Redis, memcached, or a simple database table.

For semantic: libraries like GPTCache provide this out of the box.

For prompt caching: no code change needed beyond structuring your prompts so the static parts come first.

Other Cost Levers

!!! info “Track everything” Track cost per feature, per user, per query tier. If you can’t attribute cost, you can’t optimise it.