Aurora AITell us your case

Offering

ServicesProductsCase studies

For whom

Private EquityEnterpriseSMB
ServicesProductsCase studiesAboutBlogContact

Knowledge base

Start hereWikiGlossaryGuides

Guide

Decisions & comparisons

What AI in production really costs: tokens, inference, maintenance

The per-token price is a fraction of the bill. Real cost is driven by context length, number of calls, retrieval, and maintenance. What matters is the cost per task completed.

The per-token price is not the whole cost

The most common mistake in estimating AI cost is to look at the price per token and multiply it by the number of queries. The result is usually far below the real bill, because it ignores how the system actually works in production.

Real cost is driven by four things: how much text the context holds, how many times you call the model per task, what retrieval adds, and what it costs to keep the whole thing running. The unit price is only the first of those factors.

Four cost drivers

Context length. You pay for every token in the context window, not just the user's question. System instructions, conversation history, and attached documents can be many times longer than the question itself. Every call carries that entire tail.

Number of calls. A single "user query" is rarely a single model call. An agent that plans, uses tools, and checks its output calls the model several times per task. That multiplier is often bigger than the unit price.

Retrieval. RAG adds two costs: searching for chunks in the vector store, and the tokens of those chunks once they are injected into the context. The more you add for quality, the longer the context and the higher the inference bill.

Maintenance. This is the line item that's easiest to overlook and grows the longest: quality monitoring, keeping retrieval data current, reacting to model version changes, fixing regressions. Inference shows up on the invoice; ops is often larger and hidden.

A model bill per task

Take a task handled by an agent with retrieval. The figures are illustrative — they show the structure, not any one provider's price list.

Line itemWhat it's made ofShare of the bill
Context (instructions + history)A fixed tail in every callHigh
Model calls per taskPlanning, tools, verificationHigh
Retrieval (search + tokens)Vector store + attached chunksMedium
Inference for the question itselfWhat we intuitively countLow
Maintenance (monthly)Monitoring, data, regressionsGrows over time
Operator's rule: don't price the query. Price one correctly completed task — with all the calls, context, and corrections that lead to it.

Why a cheaper model can cost more

It's tempting to pick the cheapest model and cut the bill. But a model that gets it wrong more often generates a hidden cost: retries, corrections, human work on errors, sometimes damage on the client's side. Cheaper inference per query can produce a higher cost per completed task. That is the right unit of accounting.

How to count honestly

A sound TCO has three layers. First, inference: price per token times the real context length times the number of calls per task. Second, retrieval: the cost of the vector store and the extra tokens. Third, maintenance: the fixed monthly ops overhead that doesn't disappear once you go live.

Add it all up and divide by the number of correctly completed tasks. Only that number lets you compare options — models, architectures, build vs. buy — on a single scale, instead of comparing price lists, which describe a fraction of the real cost.

Terms in this guide

Related articles

Have a concrete process, deal or bottleneck? Tell us your case.

Tell us your case See how we help

Frequently asked questions

Why is the AI bill higher than the price list suggests?
Because the price list quotes a per-token rate, and you pay for every token in the context — instructions, history, attached documents — multiplied by the number of calls. A single "user query" is often several model calls.
Does a cheaper model always lower the cost?
Not necessarily. A cheaper model that gets it wrong more often generates corrections, retries, and human work. What counts is the cost per task done correctly, not the price of a single call.
What costs the most over the long run?
Usually maintenance: quality monitoring, keeping retrieval data current, reacting to model changes, and fixing regressions. Inference is visible; ops is often larger and less visible.