ai workflows

Token Reuse

Token reuse is what happens to your AI bill and your response times the longer a conversation runs. Every time you send a new message in a multi-turn session, the model does not pick up where it left off. It starts over from the top. The entire conversation, from your first message to your most recent one, gets fed back in as input. Turn two costs more than turn one. Turn ten costs more than turn five. The bill compounds with every exchange.

This happens because transformer models are stateless. There is no memory of previous turns baked into the weights themselves. The conversation history is simply text that gets prepended to each new request, making each successive request longer than the last. The model has no idea it has already processed those words. It just sees a longer document every time.

Token reuse is not the same as your context window filling up. Those are related problems but distinct ones. The context threshold is the hard ceiling where your conversation gets truncated or the model errors out. Token reuse is the slower, quieter problem that shows up long before you hit any limit. Your context window might hold 200,000 tokens. You might never come close to filling it. You are still paying to reprocess every previous token on every new turn.

People confuse token reuse with prompt caching, which is a feature designed to reduce it, not the problem itself. Anthropic introduced prompt caching in the Claude API in August 2024. OpenAI followed with similar functionality in late 2024. Caching writes a snapshot of computed token states so the model skips recomputing them from scratch on repeat requests. It cuts the cost of token reuse. It does not eliminate it. Caching only helps when the prefix of your prompt is stable and repeated across requests. A dynamic conversation where each turn appends new context does not benefit from caching the way a static system prompt does.

Run a ten-turn brainstorming session on the Claude API with Claude Opus 4. Start with a 500-token system prompt. Add roughly 300 tokens per turn. By turn ten, you are sending about 3,500 tokens of history on every single request, on top of whatever new content you are adding. At Claude Opus pricing of $15 per million input tokens, that history alone costs roughly five cents per request. Not catastrophic for a one-off session. Scale that to a thousand users doing daily ten-turn sessions and you are looking at fifty thousand dollars a month in replayed conversation history alone.

Cursor, the AI-native code editor that hit widespread adoption in 2024, ran into this publicly. Long coding sessions with large codebases would slow down noticeably as the conversation extended. The team built context pruning into the product: strategies that trimmed older conversation turns from the window to control compounding cost and latency. The tradeoff was hard. Prune too aggressively and the model loses relevant context. Prune too little and every request bloats. There is no clean answer, only a dial to tune.

Understanding token reuse earns its keep the moment your AI workflow extends past two or three turns. Customer support bots, research assistants, multi-step content pipelines built on chat-style API calls all carry this compounding cost. Knowing that changes how you design. It pushes you toward single-turn batching where you front-load the instructions, modular prompts where each subtask starts fresh, and context management that keeps only what the model actually needs in the window at any given time. Ignore it and your cost models will be wrong by an order of magnitude once you hit real usage.

Token reuse is almost irrelevant in single-turn applications. A tool that takes one input and returns one output, a document classifier, a one-shot image description, a fill-in-the-blank generator, pays no compounding cost. Each request is independent. The problem is exclusive to systems where conversation continuity is a design feature, which is exactly when you need to plan for it from day one.

The cost of every conversation compounds like interest, and you either price that in from the start or you find out the hard way when the invoice arrives.

Read the full guide

Continue with the parent article

Related terms

Keep exploring

ai workflows

Token Reuse

Keep exploring

AI Token

Context Window

Context Threshold