Guide
Decisions & comparisons
What AI in production really costs: tokens, inference, maintenance
The per-token price is a fraction of the bill. Real cost is driven by context length, number of calls, retrieval, and maintenance. What matters is the cost per task completed.
- The per-token price is where the bill starts, not the whole cost.
- Four main cost drivers: context, number of calls, retrieval, maintenance.
- Look at the cost of one completed task, not the price of a single query.
The per-token price is not the whole cost
The most common mistake in estimating AI cost is to look at the price per token and multiply it by the number of queries. The result is usually far below the real bill, because it ignores how the system actually works in production.
Real cost is driven by four things: how much text the context holds, how many times you call the model per task, what retrieval adds, and what it costs to keep the whole thing running. The unit price is only the first of those factors.
Four cost drivers
Context length. You pay for every token in the context window, not just the user's question. System instructions, conversation history, and attached documents can be many times longer than the question itself. Every call carries that entire tail.
Number of calls. A single "user query" is rarely a single model call. An agent that plans, uses tools, and checks its output calls the model several times per task. That multiplier is often bigger than the unit price.
Retrieval. RAG adds two costs: searching for chunks in the vector store, and the tokens of those chunks once they are injected into the context. The more you add for quality, the longer the context and the higher the inference bill.
Maintenance. This is the line item that's easiest to overlook and grows the longest: quality monitoring, keeping retrieval data current, reacting to model version changes, fixing regressions. Inference shows up on the invoice; ops is often larger and hidden.
A model bill per task
Take a task handled by an agent with retrieval. The figures are illustrative — they show the structure, not any one provider's price list.
| Line item | What it's made of | Share of the bill |
|---|---|---|
| Context (instructions + history) | A fixed tail in every call | High |
| Model calls per task | Planning, tools, verification | High |
| Retrieval (search + tokens) | Vector store + attached chunks | Medium |
| Inference for the question itself | What we intuitively count | Low |
| Maintenance (monthly) | Monitoring, data, regressions | Grows over time |
Operator's rule: don't price the query. Price one correctly completed task — with all the calls, context, and corrections that lead to it.
Why a cheaper model can cost more
It's tempting to pick the cheapest model and cut the bill. But a model that gets it wrong more often generates a hidden cost: retries, corrections, human work on errors, sometimes damage on the client's side. Cheaper inference per query can produce a higher cost per completed task. That is the right unit of accounting.
How to count honestly
A sound TCO has three layers. First, inference: price per token times the real context length times the number of calls per task. Second, retrieval: the cost of the vector store and the extra tokens. Third, maintenance: the fixed monthly ops overhead that doesn't disappear once you go live.
Add it all up and divide by the number of correctly completed tasks. Only that number lets you compare options — models, architectures, build vs. buy — on a single scale, instead of comparing price lists, which describe a fraction of the real cost.
Terms in this guide
Related articles
- Claude Fable 5 and Mythos 5 — what Anthropic shipped and why you have two weeks
- How to save tokens in Claude — what's worth knowing about caching
- How not to burn through your Claude session limit — token management in practice
- Cheap executor, expensive advisor — how to match the AI model to the task
Frequently asked questions
- Why is the AI bill higher than the price list suggests?
- Because the price list quotes a per-token rate, and you pay for every token in the context — instructions, history, attached documents — multiplied by the number of calls. A single "user query" is often several model calls.
- Does a cheaper model always lower the cost?
- Not necessarily. A cheaper model that gets it wrong more often generates corrections, retries, and human work. What counts is the cost per task done correctly, not the price of a single call.
- What costs the most over the long run?
- Usually maintenance: quality monitoring, keeping retrieval data current, reacting to model changes, and fixing regressions. Inference is visible; ops is often larger and less visible.