Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise
Data Science   Latest   Machine Learning

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

Last Updated on May 27, 2026 by Editorial Team

Author(s): Darshandagaa

Originally published on Towards AI.

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise
image 1

An unoptimised agent running at 100 messages a day at 166K input tokens costs around $2,490 a month on Claude Opus 4.6. [1] That number is not a warning label. It is a real billing scenario I watched unfold on a healthcare pipeline I helped build at my current firm.

The pipeline had 14 tools registered. Every turn sent all 14 tool schemas to the model — regardless of whether the current query had anything to do with 11 of them. The system prompt was 3,800 tokens of carefully written context that the model re-read on every single message. And the outputs were verbose: paragraph-form reasoning for decisions that needed a single word.

I had tuned the retrieval logic for weeks. The evals looked clean. The bill looked insane.

Token spend in production agents is almost never an embeddings or retrieval problem. It is a prompt architecture problem — and the expensive parts are hiding in places most people never look.

TL;DR: Your agent’s token bill is dominated by tool schemas, verbose outputs, and redundant context — not retrieval. Most caching guides miss these. This article covers six techniques that go deeper, with code for each.

The Part Everyone Skips: Where the Tokens Actually Go

Prompt caching, lazy-loading tools, and sub-agent delegation are well-covered. [1] They are correct and worth implementing. But they address the symptoms, not the source.

Think of a bloated agent context like a restaurant kitchen that restocks every prep station fully before every single order — including the pastry station during a lunch rush that has never once served dessert.

That is what sending all tool schemas on every turn looks like from the billing side.

There are three token sinks that rarely get named directly:

The tool schema layer sends the full JSON definition of every registered tool to the model on every request. A single well-described tool with parameter documentation can run 200–400 tokens. With 15 tools, that is 3,000–6,000 tokens before the user has said a single word — and the model will charge you for reading all of them.

The output verbosity problem is the inverse. Most agents are prompted to explain their reasoning, which is correct for debugging and transparency, but expensive in production. A routing decision that is 3 tokens in the right format costs 150 tokens when the model writes a paragraph.

The static context problem is the most expensive of all. A system prompt that works correctly does not need to be re-read in full on every turn. Most agents never compress, summarise, or selectively load it.

Six Things to Actually Do

1. Tool schema surgery

Every parameter description you write goes into the input token count. Most tool schemas are written for human readability, not token efficiency. The fix is not to remove descriptions — it is to write them like API docs, not prose.

# Before: 187 tokens just for this tool definition
tools_verbose = [{
"name": "search_patient_records",
"description": "This tool allows you to search through the patient medical records database. You can use it to find information about a specific patient, including their medical history, current medications, allergies, and recent lab results. The search is performed using the patient's ID number which is a unique identifier assigned to each patient in the system.",
"parameters": {
"type": "object",
"properties": {
"patient_id": {
"type": "string",
"description": "The unique patient identifier number that you want to search for. This should be provided as a string in the format P followed by digits, for example P12345."
},
"fields": {
"type": "array",
"items": {"type": "string"},
"description": "An optional list of specific fields that you would like to retrieve from the patient record. If you don't provide this parameter, all fields will be returned by default."
}
},
"required": ["patient_id"]
}
}]
# After: 61 tokens — same functional information
tools_lean = [{
"name": "search_patient_records",
"description": "Search patient DB by ID. Returns medical history, meds, allergies, labs.",
"parameters": {
"type": "object",
"properties": {
"patient_id": {"type": "string", "description": "Format: P{digits}, e.g. P12345"},
"fields": {"type": "array", "items": {"type": "string"}, "description": "Optional subset of fields. Default: all."}
},
"required": ["patient_id"]
}
}]

That is a 67% reduction on one tool. Across 15 tools, that delta compounds on every single turn.

2. Dynamic tool loading by intent

Do not send tools the model does not need. Gate tool injection on a cheap intent classifier that runs before the main LLM call.

from anthropic import Anthropic
client = Anthropic()
# Full tool registry — never sent all at once
TOOL_REGISTRY = {
"patient": ["search_patient_records", "update_medication", "get_lab_results"],
"scheduling": ["book_appointment", "cancel_appointment", "check_availability"],
"billing": ["get_invoice", "process_refund", "check_insurance"],
"general": ["answer_question"] # fallback
}
def classify_intent(user_message: str) -> str:
"""Fast, cheap classification using a small/cheap model or keyword heuristic."""
keywords = {
"patient": ["patient", "medication", "lab", "allergy", "history", "record"],
"scheduling": ["appointment", "book", "schedule", "cancel", "available"],
"billing": ["invoice", "refund", "insurance", "payment", "bill"]
}
msg_lower = user_message.lower()
for intent, kwords in keywords.items():
if any(k in msg_lower for k in kwords):
return intent
return "general"

def get_tools_for_intent(intent: str) -> list:
"""Return only the tool definitions relevant to this intent."""
tool_names = TOOL_REGISTRY.get(intent, TOOL_REGISTRY["general"])
return [t for t in ALL_TOOL_DEFINITIONS if t["name"] in tool_names]

def run_agent_turn(user_message: str, conversation_history: list) -> str:
intent = classify_intent(user_message)
active_tools = get_tools_for_intent(intent)

# Only 3–4 tools sent per turn instead of 15
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=SYSTEM_PROMPT,
tools=active_tools,
messages=conversation_history + [{"role": "user", "content": user_message}]
)
return response

On a mixed-intent conversation, this reduces tool schema tokens by 70–80% per turn. The intent classifier itself costs fewer than 10 tokens when implemented as a keyword heuristic or a tiny embedding-based router.

3. Output length contracts via max_tokens + structured outputs

The model will write as much as it is allowed to. If you do not set an explicit ceiling and a format constraint, it will fill the space.

Write on Medium

The technique is to combine max_tokens with a structured output schema. When the model is constrained to emit a specific JSON shape, it stops generating the moment the shape is complete.

from pydantic import BaseModel
from typing import Literal
import anthropic
import json

class AgentDecision(BaseModel):
action: Literal["retrieve", "answer", "escalate", "clarify"]
tool_name: str | None = None
tool_args: dict
| None = None
confidence: Literal["high", "medium", "low"]
# No "reasoning" field — that 200-token explanation lives in the trace, not the output

DECISION_SCHEMA = AgentDecision.model_json_schema()

def get_routing_decision(user_message: str, context: str) -> AgentDecision:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-4-5", # Use haiku for routing decisions — 5x cheaper
max_tokens=150, # Hard ceiling; the schema completes in ~60 tokens
system=f"""You are a routing agent. Respond ONLY with valid JSON matching this schema:
{json.dumps(DECISION_SCHEMA, indent=2)}
Do not add explanation. Do not add keys not in the schema."
"",
messages=[{
"role": "user",
"content": f"Context: {context}\n\nUser: {user_message}"
}]
)

raw = response.content[0].text
return AgentDecision.model_validate_json(raw)

Two optimisations compound here. Using Claude Haiku instead of Sonnet for routing decisions cuts per-token cost by roughly 20x. [2] Setting max_tokens=150 prevents the model from elaborating. Together they drop routing call cost from ~$0.004 to ~$0.0002.

4. Context summarisation at conversation boundaries

Long conversations accumulate context that the model pays to re-read on every turn. The standard solution is a rolling summary that replaces old turns once the conversation exceeds a token threshold.

def maybe_compress_history(
messages: list[dict],
token_threshold: int = 8000,
keep_recent_turns: int = 4
) -> list[dict]:
"""
If conversation exceeds threshold, summarise all but the most recent turns.
Preserves the last N turns verbatim for continuity.
"
""
# Rough token estimate: 4 chars ≈ 1 token
total_chars = sum(len(str(m)) for m in messages)
estimated_tokens = total_chars // 4
if estimated_tokens < token_threshold:
return messages
# Split: compress old history, keep recent turns intact
to_compress = messages[:-keep_recent_turns]
to_keep = messages[-keep_recent_turns:]
if not to_compress:
return messages
history_text = "\n".join([
f"{m['role'].upper()}: {m['content']}"
for m in to_compress
if isinstance(m.get('content'), str)
])
client = anthropic.Anthropic()
summary_response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Summarise this conversation history in under 250 words. "
f"Preserve: decisions made, facts established, tools called, user intent. "
f"Discard: pleasantries, repeated context, verbose reasoning.\n\n{history_text}"
}]
)
summary = summary_response.content[0].text
compressed_history = [
{"role": "user", "content": f"[CONVERSATION SUMMARY]\n{summary}"},
{"role": "assistant", "content": "Understood. Continuing from the summary above."}
] + to_keep
return compressed_history

On a 30-turn conversation, this typically cuts the input token count for turn 31 onwards by 60–75%.

5. Thinking budget control (extended reasoning models)

Claude’s extended thinking mode, Sonnet 4.5 and Opus 4.5, exposes a budget_tokens parameter that caps how many tokens the model spends on internal reasoning before answering. [3] Most agents leave this unconfigured, which means the model reasons at length for questions that do not require it.

def call_with_thinking_budget(
prompt: str,
task_complexity: Literal["simple", "moderate", "complex"]
) -> str:
"""
Match thinking budget to actual task complexity.
Simple factual lookups do not need 10,000 tokens of internal reasoning.
"
""
budgets = {
"simple": 500, # Basic factual retrieval, classification
"moderate": 2000, # Multi-step reasoning, plan generation
"complex": 8000 # Novel problem solving, long-horizon planning
}
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=budgets[task_complexity] + 1024, # budget + output headroom
thinking={
"type": "enabled",
"budget_tokens": budgets[task_complexity]
},
messages=[{"role": "user", "content": prompt}]
)
# Return only the text response, not the thinking block
for block in response.content:
if block.type == "text":
return block.text
return ""

Routing simple queries with a 500-token thinking budget instead of the default uncapped reasoning can cut per-call cost for those queries by 80–90% while producing identical answers.

6. Tool result truncation before re-injection

When a tool returns a result, that result gets appended to the conversation and re-read on every subsequent turn. A web search that returns 4,000 tokens of content will be in the context for the rest of the session.

Truncate tool results before they enter the conversation history.

def truncate_tool_result(
tool_name: str,
raw_result: str,
max_tokens: int = 800
) -> str:
"""
Truncate tool output before it enters conversation history.
800 tokens is enough context for most downstream reasoning steps.
Anything beyond that is usually noise, not signal.
"
""
# Rough estimate: 4 chars ≈ 1 token
max_chars = max_tokens * 4
if len(raw_result) <= max_chars:
return raw_result
truncated = raw_result[:max_chars]
# Try to cut at a sentence boundary
last_period = truncated.rfind('. ')
if last_period > max_chars * 0.7: # Only cut at sentence if it's not too short
truncated = truncated[:last_period + 1]
return truncated + f"\n[Result truncated at ~{max_tokens} tokens. Full result available on request.]"

# Usage inside a tool execution handler
def execute_tool(tool_name: str, tool_args: dict) -> str:
raw = dispatch_tool(tool_name, tool_args)
# Different truncation limits per tool type
limits = {
"search_web": 600,
"get_document": 1000,
"run_sql_query": 400,
"search_patient_records": 500,
}
limit = limits.get(tool_name, 800)
return truncate_tool_result(tool_name, raw, max_tokens=limit)

On a multi-turn research agent, this single change reduced my average input token count per turn by 35% after the third tool call in a session.

Where These Techniques Break Down

None of these are free. Each comes with a specific failure mode worth knowing before deploying.

Tool schema compression fails when descriptions are so terse the model makes wrong tool-selection decisions. The threshold I’ve found: as long as you keep a clear verb and the primary parameter described, accuracy holds. Going below one sentence per tool tends to degrade routing.

Dynamic tool loading fails when the intent classifier makes wrong calls. If your keyword heuristic routes a billing question to the patient intent, the model will hallucinate tools it can’t actually call. Add a fallback: if the model requests a tool not in the active set, inject it on the next turn and retry rather than letting the error cascade.

Context summarisation fails on tasks that require exact reproduction of previous turns — legal or compliance workflows, for example, where the precise wording of a prior message matters. In those cases, do not compress: use full history and pay the cost, or store the verbatim transcript separately and retrieve it on demand.

Thinking budget control is wrong for novel problems. Capping a complex reasoning task at 500 tokens produces worse answers, not cheaper ones — because the model makes more errors and requires follow-up turns to correct them. The net cost is higher, not lower. Profile which query types actually benefit from extended reasoning before capping.

Where to Start

Under 5 minutes: Open your current agent’s tool definitions and count the tokens in the description fields alone. If any description is longer than two sentences, rewrite it in one. Measure the before/after token count. The number will surprise you.

One afternoon: Implement dynamic tool loading with a keyword-based intent classifier. Start with three intent buckets and no ML required. Measure token spend per turn before and after on your last 50 real conversations.

Production-grade: Add max_tokens ceilings to every LLM call you make — routing decisions, summarisations, structured outputs, and main agent turns. Then profile which query types are triggering extended thinking unnecessarily and add a complexity classifier. The billing drop tends to happen fast, usually within the first week of production traffic.

The expensive parts of an agentic system are almost never where the first optimisation pass looks.

Want to Go Deeper?

I’ve compiled a comprehensive GenAI Interview Prep Guide — 80+ questions with in-depth answers, architecture diagrams, and a 2-week study plan — as a downloadable resource.

👉 Get the Full GenAI Interview Prep Pack on Gumroad
👉
RAG interview questions

References

[1] Silfverskiöld, I. (May 2026). Agentic AI: How to Save on Tokens. Data Science Collective, Medium. https://medium.com/data-science-collective/agentic-ai-how-to-save-on-tokens-9a1571ac6c85

[2] Anthropic (2025). Claude Model Pricing. https://www.anthropic.com/pricing

[3] Anthropic (2025). Extended Thinking: Budget Tokens. Anthropic API Documentation. https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking

[4] Marie, B. (December 2025). Efficient LLMs at Scale: My NeurIPS Week in KV Caches, Spec Decoding, and FP4. The Kaitchup. https://kaitchup.substack.com/p/efficient-llms-at-scale-my-neurips

[5] Shaikh, S. (August 2025). Advanced Structured Outputs & Tools. Medium. https://medium.com/@sadikkhadeer/advanced-structured-outputs-tools-a99d44685b73

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.