Why does caching cut the real cost of working with Claude so sharply?

Because reading remembered context costs a fraction of fresh input

Because the tokens the model returns stop being counted

Because the model automatically shortens your messages before sending

Because every session gets a fixed, larger token limit

People claim that on a subscription (e.g. Claude Code) the cache window was cut from an hour to 5 minutes. What's actually true?

On a subscription it's still an hour; the 5 minutes is for the API and sub-agents

True — it's now 5 minutes everywhere

On a subscription it's 5 minutes, and an hour on the API

The window depends on how long the conversation is, not on the plan

Someone advises "Opus for planning, then Sonnet to execute" to save your limit. What do you need to keep in mind?

Every model switch is a cache reset and a fresh reprocessing of the conversation

Models share one cache, so switching costs nothing

A model in plan mode writes nothing to the cache at all

After a switch the cache runs faster because the conversation is already known

You edit the CLAUDE.md file mid-session. What happens to the cache?

Nothing — the change only takes effect after a restart; the current cache stays

The cache resets immediately, because the project layer changes

Only the system layer resets; the project layer stays

The cache now runs faster, because the file is already remembered

You're switching topics and want a fresh start without burning your limit. What sets /clear apart from /compact?

/compact summarizes the conversation and resets the cache in the process; /clear just clears it

They work identically — only the name differs

/clear resets the cache, while /compact preserves it

How to save tokens in Claude — what's worth knowing about caching · Wiki

If you work with Claude, a huge share of the tokens you technically process costs you only a fraction of the price — because they're cached, that is, remembered. This happens automatically: you don't have to switch anything on or change anything. It's still worth understanding a few simple rules, because they decide how fast you approach your session limit. Below is what you really need to know — without the nuances that only matter once you're using the API intensively.

Let me start with two terms. A token is a fragment of text the model operates on — roughly a piece of a word. Usage is counted for the tokens you feed the model (input) and the ones it returns (output). The cache is the mechanism that lets Claude avoid processing the same text from scratch every time — instructions, files, the earlier conversation. Once a fragment is remembered, the next question simply reads it back.

What the cache costs and how long it lives

The most important number: cached tokens cost only 10% of the price of regular input. In practice that's an enormous difference. Let me show it in numbers: if 91 million tokens pass through the model in a day and most of them are read from the cache, the cost comes out as if you'd processed about 9 million. The rest is simply re-reading remembered context.

The cache doesn't last forever, though. It has a lifetime — called TTL (time to live) in the Anthropic documentation:

A Claude subscription (e.g. Claude Code in the terminal or as an extension): 1 hour. If you don't send a message for an hour and then write the next one, everything from that session is dropped from the cache and processed afresh.
Working through the API and sub-agents: 5 minutes. That window can't be shortened, but it can be extended to an hour — for an extra charge. Sub-agents get 5 minutes regardless of plan.

It helps to know where an old confusion came from. When users complained that a subscription was "eating itself" faster, some suspected the cache window had been quietly cut from an hour to 5 minutes. That didn't happen — on a subscription it's still an hour. The misunderstanding came from the API rules and the Claude Code rules being described together, even though they're two different things.

One caveat I want to state honestly: for the web app (Claude.ai), the Anthropic documentation doesn't describe unambiguously how caching works there. It's reasonable to assume it's similar to the subscription, but that isn't confirmed one hundred percent.

What the cache is made of during a conversation

The context Claude remembers splits into three layers:

The system layer — the base instructions, tool definitions (e.g. reading and writing files, running commands, searching), the response style. This is cached globally.
The project layer — files like CLAUDE.md, a project's memory and rules. Cached separately for each project.
The conversation layer — your messages and the model's replies. This one grows with every turn and is appended as you go — and that's how it should be.

It helps to distinguish two operations. A cache write (cache create) is the one-time cost of remembering something for the first time; it pays for itself on the very next turn. A cache read (cache read) is reusing what's already remembered — and that's exactly what's ten times cheaper than fresh input.

The mechanism works step by step. On the first turn nothing is remembered yet: the model loads the system instructions, the project context and your first message, processes it all from scratch and writes it to the cache. On the second turn — as long as you stay inside the one-hour window — that whole base is already in place, so only the new reply and the new message need processing. And so on: each turn adds only a fresh fragment, while the rest is read back cheaply.

Minimalist diagram of the cache's three layers: the two top bands stable, the bottom band growing segment by segment with each turn.

The problem appears when that base suddenly needs to be remembered again. Imagine you're on the sixteenth message — the entire earlier conversation is already being read back cheaply from the cache. If at that moment you do something that resets the cache, everything is processed from the start again. That's a costly move.

What resets the cache

Three things throw out the remembered context and force it to be processed from scratch:

A break longer than an hour (on a subscription) or longer than 5 minutes (with the API and sub-agents).
A change to the system instructions — when the thing the whole session rests on changes.
Switching the model mid-session. Each model has its own cache. After a switch, the next request reads the entire conversation so far with no cache hit at all — even if the content is identical.

That last point has a practical catch. A setup like "Opus for planning" (the Opus model in plan mode, then Sonnet to execute) is sometimes recommended as a way to save your session limit. But you should know that every such switch is a model change — and therefore a cache reset and fresh processing. Over the long run that setup may still save the limit, but the switching itself isn't free.

And one thing that works the opposite way to what you'd expect: editing the CLAUDE.md file mid-session does not reset the cache. The change only takes effect after the session is restarted, so the current cache stays intact.

Three habits that are enough for most people

I'll boil the whole thing down to three rules that, in my view, cover the needs of the vast majority of users: keep the session alive, keep it focused, and start over when you change tasks.

Don't take too long a break. If a session has been sitting for more than an hour, rather than returning to it, hand the work off to a new one. Coming back after a long break means reprocessing the whole thing anyway.
Start clean when you switch topics. The /clear command clears the session; /compact summarizes it — and resets the cache in the process. An alternative is a "session handoff": a short summary of what's done, which files were created and where to pick up, which you copy into a new, clean session. The effect is as if you'd lost nothing.
In Claude chat, put large documents into a project. If you use Claude through the website and plan to paste in extensive material, it's better to set up a project than to drop it straight into the conversation. Files in a project are cached in a way better suited to holding many documents. (This is an area the documentation doesn't describe directly — treat it as a sensible tip, not a hard rule.)

Three minimalist glowing icons in a row: a clock, a refresh arrow with a clean sheet, and a folder of documents.

Three simple session habits that are enough for most people to stop burning through tokens.

What you really need to know

Token caching can get very complex — the full documentation goes into nuances that aren't needed in ordinary work with Claude. The core is simple: remembered context costs 10% of the price of fresh input, on a subscription it lives for an hour, and it's reset by a long break, a change to the system instructions, and switching the model mid-session.

Hence a takeaway worth adopting more broadly than just for tokens: keep up with changes in your tools, but each time ask how much of it you actually need to know to do good work. The three habits above — a live session, a focused session, a fresh start when you change tasks — are enough to stop burning through your limits. The rest you can read up on when you genuinely start to need it.