Anthropic prompt cache: when the 25% write premium actually pays back
Anthropic’s prompt cache is one of those features that gets described as “90% cheaper” in marketing and “it depends” by everyone who has actually paid the bill. This post is the spreadsheet version: when does the 25% write premium pay back, when does it lose you money, and what should you do when running agentic workflows that rewrite cache entries every few seconds.
The three rates you actually pay
Anthropic charges three different per-input-token rates depending on what happens with each token:
| Token bucket | Rate vs base input | When it applies |
|---|---|---|
| Regular input | 1.0x | Tokens that are not part of any cache block. |
| Cache write | 1.25x | First time you send a block marked with cache_control. Anthropic stores it and charges 25% extra. |
| Cache read | 0.1x | Subsequent calls within 5 minutes that hit the same cached prefix. Costs 10% of base input. |
So “90% cheaper” is true — for the read rate. But every read presupposes a write, and every write costs more than the no-cache baseline. The question is how many reads you need before the write pays back.
The break-even formula
Let’s say a cacheable block is Tinput tokens. Per Anthropic’s base rate p per million tokens, the cost comparison over N calls is:
You need to read the cache at least twice within the 5-minute window for it to pay off. Once is a net loss (you paid 1.25x and only saved 0.9x). Two reads break even. Three or more reads are where the dramatic savings start showing up.
Where caching reliably wins
Three patterns produce 5+ reads per write reliably:
- Long system prompts. A 4000-token system prompt shared across a chat session. Every user turn reads the cache. Even a moderate-length conversation (5 turns) saves ~70%.
- Document Q&A. Upload a 20K-token document, then ask 10 questions about it within a few minutes. Cache write on question 1, cache read on questions 2–10. Saves ~85% vs no cache.
- Agent inner loops. When an agent re-sends the same tool definitions + system prompt on every step, those are cacheable. Often hundreds of reads per write.
Where caching loses you money
Three patterns to be careful with:
- One-shot calls.If you only call Claude once with a given prompt, marking it cacheable means you pay 1.25x for nothing. Don’t mark
cache_controlunless you genuinely plan to re-call. - Slow agents. Multi-step agents with long tool-execution gaps between LLM calls. If a single tool takes 6 minutes to run, your cache expired before the next LLM call, and you just paid the write premium for one read.
- Cache-busting prompts. A prompt that includes the current timestamp, request UUID, or anything else that changes per-call invalidates the cache. We have seen production code that inlined
new Date()into the system prompt and paid 1.25x on every single call for years.
Who eats the 1.25x write premium?
This is where AI gateways get to make a choice with real money on the line.
Option A: pass-through pricing.The gateway charges users 1.25x on writes, 0.1x on reads, identical to Anthropic’s billing. Honest, but introduces variance — a user who happens to miss the cache window pays more than predicted, and writes Reddit threads asking why their bill jumped.
Option B: absorb the write premium. The gateway charges users 1.0x on writes (eating the 0.25x as a cost of doing business) and only passes through the 0.1x read discount. More predictable for the user, but means gateways carry the variance. Sustainable only if average read counts comfortably exceed 1.
We picked Option B at ApiLink. Two reasons:
- Most legitimate use of caching produces 5+ reads per write, so the math works in expectation.
- Users predicting their own bill is a major friction point. “Why did this call cost 25% more than yesterday?” is a support ticket we’d rather not write.
Worth saying: this only works if your users don’t intentionally write-then-disconnect to abuse the absorbed premium. We rate-limit single-shot cache writes to keep that path closed.
Pre-flight checklist before you enable caching
- Estimate average reads-per-write for the workload. If under 2, don’t cache.
- Make sure no per-call mutable data (timestamps, UUIDs, user IDs in some implementations) sneaks into the cacheable prefix.
- Check whether your gateway absorbs the write premium or passes it through. If pass-through, your bill volatility on cache-miss days will be ~25% higher than the headline rate.
- Instrument cache hit rate per workflow. A workflow that drops below 50% hit rate is leaking money — investigate before next month’s invoice.
- For agentic loops: log time-between-calls. Any gap over 4 minutes is a cache risk; over 5 minutes is a guaranteed miss.
Closing
Prompt caching is a real win for the workloads it’s designed for, and a quiet 25% surcharge for the workloads it isn’t. The deciding number is “reads per write per 5-minute window.” If you can’t measure it, don’t enable it. If you can, and it’s 3+, turn it on and watch your bill drop.
For what it’s worth, we run instrumentation on cache hit rates in our admin panel; if you’re curious how your specific workload looks under caching, sign up and check the analytics tab after a few thousand calls.