billinganthropicengineering

Anthropic prompt cache: when the 25% write premium actually pays back

ApiLink Team·May 18, 2026·7 min read简体中文

Anthropic’s prompt cache is one of those features that gets described as “90% cheaper” in marketing and “it depends” by everyone who has actually paid the bill. This post is the spreadsheet version: when does the 25% write premium pay back, when does it lose you money, and what should you do when running agentic workflows that rewrite cache entries every few seconds.

The three rates you actually pay

Anthropic charges three different per-input-token rates depending on what happens with each token:

Token bucket	Rate vs base input	When it applies
Regular input	1.0x	Tokens that are not part of any cache block.
Cache write	1.25x	First time you send a block marked with cache_control. Anthropic stores it and charges 25% extra.
Cache read	0.1x	Subsequent calls within 5 minutes that hit the same cached prefix. Costs 10% of base input.

So “90% cheaper” is true — for the read rate. But every read presupposes a write, and every write costs more than the no-cache baseline. The question is how many reads you need before the write pays back.

The break-even formula

Let’s say a cacheable block is Tinput tokens. Per Anthropic’s base rate p per million tokens, the cost comparison over N calls is:

text

Without cache:    N * T * p
With cache:       1.25 * T * p   (one-time write)
                + (N - 1) * T * 0.1 * p   (subsequent reads)

Break-even:       N * p = 1.25*p + (N-1) * 0.1*p
                  N = 1.25 + 0.1*(N-1)
                  0.9N = 1.15
                  N ≈ 1.28

You need to read the cache at least twice within the 5-minute window for it to pay off. Once is a net loss (you paid 1.25x and only saved 0.9x). Two reads break even. Three or more reads are where the dramatic savings start showing up.

The 5-minute TTL is critical. Anthropic ages out cache entries 5 minutes after the last read. If your second read is at minute 6, the cache is gone and you pay 1.25x to rewrite it. We have seen agent workflows that idle for exactly long enough to lose the cache on every loop.

Where caching reliably wins

Three patterns produce 5+ reads per write reliably:

Long system prompts. A 4000-token system prompt shared across a chat session. Every user turn reads the cache. Even a moderate-length conversation (5 turns) saves ~70%.
Document Q&A. Upload a 20K-token document, then ask 10 questions about it within a few minutes. Cache write on question 1, cache read on questions 2–10. Saves ~85% vs no cache.
Agent inner loops. When an agent re-sends the same tool definitions + system prompt on every step, those are cacheable. Often hundreds of reads per write.

Where caching loses you money

Three patterns to be careful with:

One-shot calls.If you only call Claude once with a given prompt, marking it cacheable means you pay 1.25x for nothing. Don’t mark cache_control unless you genuinely plan to re-call.
Slow agents. Multi-step agents with long tool-execution gaps between LLM calls. If a single tool takes 6 minutes to run, your cache expired before the next LLM call, and you just paid the write premium for one read.
Cache-busting prompts. A prompt that includes the current timestamp, request UUID, or anything else that changes per-call invalidates the cache. We have seen production code that inlined new Date() into the system prompt and paid 1.25x on every single call for years.

Who eats the 1.25x write premium?

This is where AI gateways get to make a choice with real money on the line.

Option A: pass-through pricing.The gateway charges users 1.25x on writes, 0.1x on reads, identical to Anthropic’s billing. Honest, but introduces variance — a user who happens to miss the cache window pays more than predicted, and writes Reddit threads asking why their bill jumped.

Option B: absorb the write premium. The gateway charges users 1.0x on writes (eating the 0.25x as a cost of doing business) and only passes through the 0.1x read discount. More predictable for the user, but means gateways carry the variance. Sustainable only if average read counts comfortably exceed 1.

We picked Option B at ApiLink. Two reasons:

Most legitimate use of caching produces 5+ reads per write, so the math works in expectation.
Users predicting their own bill is a major friction point. “Why did this call cost 25% more than yesterday?” is a support ticket we’d rather not write.

Worth saying: this only works if your users don’t intentionally write-then-disconnect to abuse the absorbed premium. We rate-limit single-shot cache writes to keep that path closed.

Pre-flight checklist before you enable caching

Estimate average reads-per-write for the workload. If under 2, don’t cache.
Make sure no per-call mutable data (timestamps, UUIDs, user IDs in some implementations) sneaks into the cacheable prefix.
Check whether your gateway absorbs the write premium or passes it through. If pass-through, your bill volatility on cache-miss days will be ~25% higher than the headline rate.
Instrument cache hit rate per workflow. A workflow that drops below 50% hit rate is leaking money — investigate before next month’s invoice.
For agentic loops: log time-between-calls. Any gap over 4 minutes is a cache risk; over 5 minutes is a guaranteed miss.

Closing

Prompt caching is a real win for the workloads it’s designed for, and a quiet 25% surcharge for the workloads it isn’t. The deciding number is “reads per write per 5-minute window.” If you can’t measure it, don’t enable it. If you can, and it’s 3+, turn it on and watch your bill drop.

For what it’s worth, we run instrumentation on cache hit rates in our admin panel; if you’re curious how your specific workload looks under caching, sign up and check the analytics tab after a few thousand calls.

About ApiLink

ApiLink is an OpenAI-compatible gateway for GPT, Claude, Gemini, DeepSeek and more. One key, transparent streaming-safe billing, RMB invoicing for China-based teams.

Learn more →

Keep reading

ApiLink vs OpenRouter vs ZenMux: an honest gateway comparison

Three AI gateways, side by side. Where each one wins, where each one loses, and the honest answer about using more than one.

Pointing OpenAI Codex CLI at a third-party gateway

Two environment variables and Codex talks to Claude, Gemini, or DeepSeek instead of GPT. Plus the same trick for Cursor, Aider, Cline.

Using Claude/GPT/Gemini from China: a compliance checklist

Payment, invoicing, forex, data residency — every wall Chinese teams hit a quarter into using OpenAI or Anthropic, with a concrete checklist.