Among all the challenges of implementing agentic artificial intelligence, the least-understood issue is cost. The providers of AI, such as OpenAI, Google, and Anthropic, have price lists, but none of those listed prices tell users what the final bill will be to actually solve a problem. The result, according to a new study of costs from the University of Michigan and collaborating institutions, could be sticker shock: soaring and unpredictable costs of agents.
The study, led by Longju Bai of Michigan and including researchers from Stanford University, All Hands AI, Google's DeepMind, Microsoft, and MIT, is titled "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks." It is the first systematic study on AI Agent token consumption. The paper, posted on the arXiv pre-print server, includes prominent Stanford economist Erik Brynjolfsson, who has written extensively on AI's impact on productivity.
The top-level finding is that agents consume orders of magnitude more tokens than turn-by-turn, simple, prompt-based chats. For example, an agent can use 3,500 times the number of tokens as a round of prompts with ChatGPT. A token is the fundamental unit of information processed by an AI model—it could be a piece of a word, a whole word, or just a punctuation mark. While one might expect agents to cost more in tokens, the study reveals more alarming facts: two different models can have wildly different token costs for the same task, and the same model can have different costs each time it works on the same problem, using as many as twice the number of tokens on one occasion compared to another.
The worst part is that none of this can be predicted. Agents, Bai and team found, cannot reliably estimate how many tokens they will ultimately consume for a given task. "Agentic tasks are uniquely expensive," they wrote, while more tokens don't necessarily improve results. "Simply scaling token usage may not lead to higher execution performance," they noted, and models systematically underestimate the tokens they need.
This rising cost and uncertainty of success are not accounted for in today's price lists from providers. The work suggests there is no easy fix. The best users can do is set hard limits on agentic computer use, possibly causing agents to halt before completing tasks. The big picture is that users collectively will have to push back on vendors and demand reliable cost estimation and guarantees of task performance.
Counting token costs
To study costs, the team used the open-source agentic AI framework OpenHands, developed by scholars at the University of Illinois Urbana-Champaign and others. They built agents and tested them on the open-source coding benchmark SWE-Bench, which uses tasks taken from actual GitHub issues. They first found the relative strengths of models: OpenAI's ChatGPT 5 and 5.2 achieve strong accuracy at low cost, though not the highest accuracy. Anthropic's Claude Sonnet-4.5 achieved the highest accuracy but at higher token costs. Google's Gemini-3-Pro was in the middle. The Kimi-K2 model from Chinese lab Moonshot had the worst relative mix—most tokens for lowest accuracy.
The authors suggested the differences stem from unique architectural properties: "The gap is not driven by task difficulty or by some models attempting harder problems. Instead, the same task is simply more expensive for some models than others, reflecting a behavioral tendency of the model rather than a property of the problem." But even the same model can take twice as many tokens to solve the same problem from one run to the next. "The most expensive runs double the token and monetary cost of the least expensive runs," they observed, indicating that agent token consumption has large variances even on identical problems.
More tokens don't necessarily get better results. Accuracy often peaks at intermediate cost and saturates at higher costs. Agent behavior becomes increasingly unstable on more complex tasks. Many models seem to search endlessly for solutions even when fruitless. "Models lack a reliable mechanism to recognize when a task is unsolvable and stop early," wrote Bai and team. "Instead, they continue exploring, retrying, and re-reading context, accumulating cost without progress."
Unable to predict costs
These factors make token usage prediction and agent pricing fundamentally challenging. The team asked each AI agent to predict its tokens using a prompt like: "You are a TOKEN ESTIMATION agent. Estimate the token cost to fix the following issue description." They found that agents can approximate to a small degree, but predictions tend to be too low. "Models consistently underestimate the tokens they need," wrote Bai and team. The bias is especially pronounced for input tokens, whose predictions stay compressed even as real values grow into the millions.
Watch those inputs
A critical finding concerns input tokens. Input tokens—such as what is typed by the human user or retrieved via tools like database searches—dominate the cost. Output tokens and cached tokens held in memory are far less demanding. "Strikingly, input tokens, not output tokens, dominate the overall cost in agentic coding." The reason is that agentic workflows accumulate information from different sources, and the same context gets fed into models repeatedly, leading to a dramatically higher input/output ratio compared to single-prompt or multi-prompt sessions. Drilling down further, cache reads dominate both raw token volume and dollar cost. In every phase, cache-read input tokens are the largest category, reflecting cumulative reuse of prior context.
Implications and user steps
Overall, the study confirms anecdotal experience with coding agents like Replit and Lovable, where costs constantly accumulate without transparency. What can be done? The authors suggest that even if agents can't predict exact tokens, they can offer coarse-grained estimates that support early budget alerts before launching expensive runs, improving cost transparency without overpromising precise token-level accuracy.
Since input tokens are the biggest cost element, users should control what they can at input: prompt size, context window width, and the number of tools called by the agent (such as databases). But there's only so much a user can do. Industry-wide changes are needed. The problems are those of a young industry, and vendors will have to be pushed by users to change practices. The lack of transparency regarding what an agent might cost to complete a task is too vague for enterprises that need to plan software investments. The burden is on users to run agentic tasks experimentally over and over to get average costs for planning.
And the lack of guarantees of success—even after the agent burns through tokens—is the most glaring problem. Enterprises could waste vast amounts of money just running tokens. Users collectively must push back on vendors like OpenAI, Google, and Anthropic, demanding price transparency and some form of guarantee that a task will be completed. Otherwise, the entire agentic AI exercise may be dominated by cost overruns and failed implementations. Such deep problems are probably already encountered by early adopters, who may be content to pay high costs for an early edge. But this is not a situation that can lead to stable, steady use of agentic AI.
Source: ZDNET News