How not to burn through your Claude session limit — token management in practice · Wiki

Your Claude session limit runs out mostly for one reason: with every message, the model re-reads the entire conversation so far and bills tokens for it. The longer the conversation, the more the cost grows — not bit by bit, but in a snowball. The good news is that this is a matter of habits, not arcane knowledge — a few simple changes make the same plan stretch across far more work. I'll show you where these costs come from and what to do about them.

A few terms to start

Before I go further, let me fix the vocabulary. A token is the smallest fragment of text the model reads and is charged for — roughly a piece of a word, though that isn't a strict rule. A session is a single, continuous conversation with the model. The session limit is the usage ceiling after which you have to wait for a reset before you can work on. The model is the AI engine itself that responds — Claude in its various versions, for instance.

The context window is the amount of text the model "sees" at once: the system instructions, the whole conversation, every tool call and its result, every file it has read. It's its current working memory. Claude Code is the tool in which Claude works on programming and operational tasks — reading files, carrying out steps, using tools. It offers a context window on the order of a million tokens, which is a great deal.

Except that before you type anything, part of that window is already taken. The project instructions file, the connected tools, the extra skills — all of it loads at the start. Typical overhead is around 8,000 tokens, but it can reach 62,000 already occupied on a fresh session. It's worth checking: in Claude Code the /context command shows how much is taken before you even begin. Sometimes that's a signal to remove something or move it elsewhere.

Why the cost snowballs

The most important thing to grasp is this: with each new message the model reads the whole conversation from the beginning. The first question, its answer, the second question, the answer — and so on every time, up to the latest prompt. That's why the cost doesn't add up, it compounds. The first message might cost 500 tokens, and the thirtieth a dozen or more thousand, because along with it the model reprocesses everything that came before.

There's a concrete observation on this: one programmer traced a conversation of more than a hundred messages and worked out that 98.5% of all the tokens went purely on re-reading the earlier history. You could argue the model genuinely needs that context — but a ratio like that is enormous waste. Almost all the habits described below follow from it.

Context rot — when a longer conversation degrades the answers

There's a second cost too, less obvious than tokens. As the session grows, answer quality drops — because the model's attention is spread across an ever-larger number of tokens. The model starts to get distracted, lose the thread, contradict itself, edit files without reading them first. Half in jest I call it "AI dementia"; in Anthropic's materials it goes by context rot.

The numbers bear this out. The accuracy of retrieving information from the window drops from 92% at 256,000 tokens to 78% at a million. In other words: even if you manage to fill the whole million-token window, the model will be markedly worse at locating what it needs within it. And a worse model means worse token efficiency — to get the same result, you have to spend more of them. Hence a simple conclusion: a bigger window doesn't mean a better result. It's a safety margin rather than a target to be filled.

An abstract image: on the left a clean, bright stream of light, on the right the same stream scattered into a hazy, dimmed smudge — the contrast between a fresh and an overloaded session.

Five things you can do after every answer

Anthropic's material puts it clearly: after each answer from the model you have five options. You can continue — simply write another message (it's easy to slip into this reflex and drag the conversation on endlessly). You can use /rewind — go back to an earlier message and erase everything that came after it. You can do /clear — start completely fresh. You can /compact — summarize the session and replace the history with that summary. Or you can hand the task to a subagent, delegating it to a fresh context window and getting back just the result.

The most important habit Anthropic recommends is /rewind. When the model does something wrong, most of us simply write "that didn't work, try a different way." It often helps — but that failed attempt, the wrong code and the bad path stay in the context and get read on every subsequent message, cluttering the answers that follow. /rewind takes you back to the point before the error, and the context stays clean.

Clearing, compacting and summaries

The rule from the documentation is simple: starting a new task — do /clear; continuing the same one — do /compact. Compacting is exactly that, folding the conversation so far into a summary. The trouble is that Claude Code only triggers it automatically at around 95% of the window being full — which, in my view, is far too late. With automatic compacting only some 20–30% of the original detail survives, and the model produces that summary at the worst possible moment for it, at the peak of context rot. It's like packing a suitcase five minutes before you leave instead of calmly, the day before.

I recommend my own variant, which usually eases the session limit the most. Before you reach the edge, ask the model for a full summary: what's been done so far and what we're doing next. You take that summary, do /clear, paste it back in and work on a fresh session — with all the context you need, as if nothing had reset. There's one condition: since you're losing the conversation history, you have to record your decisions elsewhere — in plan files, decision logs, task lists. It's like closing all the tabs in your browser while keeping your bookmarks.

Subagents, or handing work off to the side

A subagent is a separate helper that gets its own fresh context window, does its task and sends back to your main session just the result. Picture an intern: if they're to go through fifty articles, you don't sit beside them and read along — you just ask for the finished summary. Same here: all the tedious work happens outside your window, so it doesn't clutter the main conversation. You can write plainly: "run a subagent to verify this." What's more, a subagent can use a cheaper model, which lowers the cost further at comparable quality. The art is in sensing which tasks are fit to delegate.

Fewer tokens at the source

A few simpler tricks cut usage before the conversation even begins.

Convert files to markdown. Markdown is a lightweight way of recording just the text, with no needless wrapping. Converting from HTML can shave off around 90% of the tokens, from PDF roughly 65–70%, from docx files about 33% — the model doesn't have to wade through page layout and formatting, the content alone is enough for it. (The exception is scans that need text recognized from an image; there the rule doesn't hold.)
Mind the project instructions file. A file like claude.md loads with every session, so its bloat costs you every time. I'd keep it under about 200 lines (roughly 2,000 tokens) and write in only what the model genuinely needs; move the rest into skills and files read on demand.
Start with a plan. Making a plan before you build spends tokens up front, but saves them later — because the model wanders into a dead end less often and needs less correcting.
Watch the limit indicator. Just seeing how much of your session is left changes your decisions: whether to send this prompt now, or to launch a team of agents. One honest note from me — some of the problems with the model aren't the tool's fault, but the way you work. It's worth taking a bit of the responsibility yourself.

An abstract image of an assembly line made of light: three separated segments — discovery, planning, execution — joined by a single luminous belt on a dark background.

Why it's not worth filling the whole window

To close, a thought that ties it all together. When people hear "a million tokens," they think they have that much room to use freely — and they get wasteful: they stop using subagents, stop acting deliberately, dump everything into one giant session. But the rules of how the models behave haven't changed. A bigger window isn't a better result, just more room for context rot.

The data backs this. In an analysis of 18,000 thinking blocks from 7,000 sessions, reasoning depth fell by 67% as the session lengthened, and editing files without reading them rose from 6% to 34% — the longer the session, the lazier and sloppier the model gets. In an extreme case, one user, through bad habits, drove a token bill from $345 to $42,000 a month, while the quality of the work stayed flat — the same result, a drastically higher cost.

The conclusion isn't complicated. What's genuinely productive is the first ten-to-twenty percent of a session, when the model is freshest. If you're just starting out, stay for a while with the smaller, 200,000-token window and build yourself some discipline: clear the session, record your progress, hand work to subagents. A bigger window can be like cookies on the desk while you're dieting — the more room there is, the easier it is to fall into worse habits. And if you feel the session has gone off the rails, don't fight it — open a new one and start clean.