Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

Last Updated on May 27, 2026 by Editorial Team

Author(s): Darshandagaa

Originally published on Towards AI.

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

An unoptimised agent running at 100 messages a day at 166K input tokens costs around $2,490 a month on Claude Opus 4.6. [1] That number is not a warning label. It is a real billing scenario I watched unfold on a healthcare pipeline I helped build at my current firm.

The pipeline had 14 tools registered. Every turn sent all 14 tool schemas to the model — regardless of whether the current query had anything to do with 11 of them. The system prompt was 3,800 tokens of carefully written context that the model re-read on every single message. And the outputs were verbose: paragraph-form reasoning for decisions that needed a single word.

I had tuned the retrieval logic for weeks. The evals looked clean. The bill looked insane.

Token spend in production agents is almost never an embeddings or retrieval problem. It is a prompt architecture problem — and the expensive parts are hiding in places most people never look.

TL;DR: Your agent’s token bill is dominated by tool schemas, verbose outputs, and redundant context — not retrieval. Most caching guides miss these. This article covers six techniques that go deeper, with code for each.

The Part Everyone Skips: Where the Tokens Actually Go

Prompt caching, lazy-loading tools, and sub-agent delegation are well-covered. [1] They are correct and worth implementing. But they address the symptoms, not the source.

Think of a bloated agent context like a restaurant kitchen that restocks every prep station fully before every single order — including the pastry station during a lunch rush that has never once served dessert.

That is what sending all tool schemas on every turn looks like from the billing side.

There are three token sinks that rarely get named directly:

The tool schema layer sends the full JSON definition of every registered tool to the model on every request. A single well-described tool with parameter documentation can run 200–400 tokens. With 15 tools, that is 3,000–6,000 tokens before the user has said a single word — and the model will charge you for reading all of them.

The output verbosity problem is the inverse. Most agents are prompted to explain their reasoning, which is correct for debugging and transparency, but expensive in production. A routing decision that is 3 tokens in the right format costs 150 tokens when the model writes a paragraph.

The static context problem is the most expensive of all. A system prompt that works correctly does not need to be re-read in full on every turn. Most agents never compress, summarise, or selectively load it.

Six Things to Actually Do

1. Tool schema surgery

Every parameter description you write goes into the input token count. Most tool schemas are written for human readability, not token efficiency. The fix is not to remove descriptions — it is to write them like API docs, not prose.

# Before: 187 tokens just for this tool definition
tools_verbose = [{
 "name": "search_patient_records",
 "description": "This tool allows you to search through the patient medical records database. You can use it to find information about a specific patient, including their medical history, current medications, allergies, and recent lab results. The search is performed using the patient's ID number which is a unique identifier assigned to each patient in the system.",
 "parameters": {
 "type": "object",
 "properties": {
 "patient_id": {
 "type": "string",
 "description": "The unique patient identifier number that you want to search for. This should be provided as a string in the format P followed by digits, for example P12345."
 },
 "fields": {
 "type": "array",
 "items": {"type": "string"},
 "description": "An optional list of specific fields that you would like to retrieve from the patient record. If you don't provide this parameter, all fields will be returned by default."
 }
 },
 "required": ["patient_id"]
 }
}]
# After: 61 tokens — same functional information
tools_lean = [{
 "name": "search_patient_records",
 "description": "Search patient DB by ID. Returns medical history, meds, allergies, labs.",
 "parameters": {
 "type": "object",
 "properties": {
 "patient_id": {"type": "string", "description": "Format: P{digits}, e.g. P12345"},
 "fields": {"type": "array", "items": {"type": "string"}, "description": "Optional subset of fields. Default: all."}
 },
 "required": ["patient_id"]
 }
}]

That is a 67% reduction on one tool. Across 15 tools, that delta compounds on every single turn.

2. Dynamic tool loading by intent

Do not send tools the model does not need. Gate tool injection on a cheap intent classifier that runs before the main LLM call.

from anthropic import Anthropic
client = Anthropic()
# Full tool registry — never sent all at once
TOOL_REGISTRY = {
 "patient": ["search_patient_records", "update_medication", "get_lab_results"],
 "scheduling": ["book_appointment", "cancel_appointment", "check_availability"],
 "billing": ["get_invoice", "process_refund", "check_insurance"],
 "general": ["answer_question"] # fallback
}
def classify_intent(user_message: str) -> str:
 """Fast, cheap classification using a small/cheap model or keyword heuristic."""
 keywords = {
 "patient": ["patient", "medication", "lab", "allergy", "history", "record"],
 "scheduling": ["appointment", "book", "schedule", "cancel", "available"],
 "billing": ["invoice", "refund", "insurance", "payment", "bill"]
 }
 msg_lower = user_message.lower()
 for intent, kwords in keywords.items():
 if any(k in msg_lower for k in kwords):
 return intent
 return "general"

def get_tools_for_intent(intent: str) -> list:
 """Return only the tool definitions relevant to this intent."""
 tool_names = TOOL_REGISTRY.get(intent, TOOL_REGISTRY["general"])
 return [t for t in ALL_TOOL_DEFINITIONS if t["name"] in tool_names]

def run_agent_turn(user_message: str, conversation_history: list) -> str:
 intent = classify_intent(user_message)
 active_tools = get_tools_for_intent(intent)

 # Only 3–4 tools sent per turn instead of 15
 response = client.messages.create(
 model="claude-sonnet-4-5",
 max_tokens=1024,
 system=SYSTEM_PROMPT,
 tools=active_tools,
 messages=conversation_history + [{"role": "user", "content": user_message}]
 )
 return response

On a mixed-intent conversation, this reduces tool schema tokens by 70–80% per turn. The intent classifier itself costs fewer than 10 tokens when implemented as a keyword heuristic or a tiny embedding-based router.

3. Output length contracts via `max_tokens` + structured outputs

The model will write as much as it is allowed to. If you do not set an explicit ceiling and a format constraint, it will fill the space.

The technique is to combine max_tokens with a structured output schema. When the model is constrained to emit a specific JSON shape, it stops generating the moment the shape is complete.

from pydantic import BaseModel
from typing import Literal
import anthropic
import json

class AgentDecision(BaseModel):
 action: Literal["retrieve", "answer", "escalate", "clarify"]
 tool_name: str | None = None
 tool_args: dict | None = None
 confidence: Literal["high", "medium", "low"]
 # No "reasoning" field — that 200-token explanation lives in the trace, not the output

DECISION_SCHEMA = AgentDecision.model_json_schema()

def get_routing_decision(user_message: str, context: str) -> AgentDecision:
 client = anthropic.Anthropic()
 response = client.messages.create(
 model="claude-haiku-4-5", # Use haiku for routing decisions — 5x cheaper
 max_tokens=150, # Hard ceiling; the schema completes in ~60 tokens
 system=f"""You are a routing agent. Respond ONLY with valid JSON matching this schema:
{json.dumps(DECISION_SCHEMA, indent=2)}
Do not add explanation. Do not add keys not in the schema.""",
 messages=[{
 "role": "user",
 "content": f"Context: {context}\n\nUser: {user_message}"
 }]
 )

 raw = response.content[0].text
 return AgentDecision.model_validate_json(raw)

Two optimisations compound here. Using Claude Haiku instead of Sonnet for routing decisions cuts per-token cost by roughly 20x. [2] Setting max_tokens=150 prevents the model from elaborating. Together they drop routing call cost from ~$0.004 to ~$0.0002.

4. Context summarisation at conversation boundaries

Long conversations accumulate context that the model pays to re-read on every turn. The standard solution is a rolling summary that replaces old turns once the conversation exceeds a token threshold.

def maybe_compress_history(
 messages: list[dict],
 token_threshold: int = 8000,
 keep_recent_turns: int = 4
) -> list[dict]:
 """
 If conversation exceeds threshold, summarise all but the most recent turns.
 Preserves the last N turns verbatim for continuity.
 """
 # Rough token estimate: 4 chars ≈ 1 token
 total_chars = sum(len(str(m)) for m in messages)
 estimated_tokens = total_chars // 4
 if estimated_tokens < token_threshold:
 return messages
 # Split: compress old history, keep recent turns intact
 to_compress = messages[:-keep_recent_turns]
 to_keep = messages[-keep_recent_turns:]
 if not to_compress:
 return messages
 history_text = "\n".join([
 f"{m['role'].upper()}: {m['content']}"
 for m in to_compress
 if isinstance(m.get('content'), str)
 ])
 client = anthropic.Anthropic()
 summary_response = client.messages.create(
 model="claude-haiku-4-5",
 max_tokens=300,
 messages=[{
 "role": "user",
 "content": f"Summarise this conversation history in under 250 words. "
 f"Preserve: decisions made, facts established, tools called, user intent. "
 f"Discard: pleasantries, repeated context, verbose reasoning.\n\n{history_text}"
 }]
 )
 summary = summary_response.content[0].text
 compressed_history = [
 {"role": "user", "content": f"[CONVERSATION SUMMARY]\n{summary}"},
 {"role": "assistant", "content": "Understood. Continuing from the summary above."}
 ] + to_keep
 return compressed_history

On a 30-turn conversation, this typically cuts the input token count for turn 31 onwards by 60–75%.

5. Thinking budget control (extended reasoning models)

Claude’s extended thinking mode, Sonnet 4.5 and Opus 4.5, exposes a budget_tokens parameter that caps how many tokens the model spends on internal reasoning before answering. [3] Most agents leave this unconfigured, which means the model reasons at length for questions that do not require it.

def call_with_thinking_budget(
 prompt: str,
 task_complexity: Literal["simple", "moderate", "complex"]
) -> str:
 """
 Match thinking budget to actual task complexity.
 Simple factual lookups do not need 10,000 tokens of internal reasoning.
 """
 budgets = {
 "simple": 500, # Basic factual retrieval, classification
 "moderate": 2000, # Multi-step reasoning, plan generation
 "complex": 8000 # Novel problem solving, long-horizon planning
 }
 client = anthropic.Anthropic()
 response = client.messages.create(
 model="claude-sonnet-4-5",
 max_tokens=budgets[task_complexity] + 1024, # budget + output headroom
 thinking={
 "type": "enabled",
 "budget_tokens": budgets[task_complexity]
 },
 messages=[{"role": "user", "content": prompt}]
 )
 # Return only the text response, not the thinking block
 for block in response.content:
 if block.type == "text":
 return block.text
 return ""

Routing simple queries with a 500-token thinking budget instead of the default uncapped reasoning can cut per-call cost for those queries by 80–90% while producing identical answers.

6. Tool result truncation before re-injection

When a tool returns a result, that result gets appended to the conversation and re-read on every subsequent turn. A web search that returns 4,000 tokens of content will be in the context for the rest of the session.

Truncate tool results before they enter the conversation history.

def truncate_tool_result(
 tool_name: str,
 raw_result: str,
 max_tokens: int = 800
) -> str:
 """
 Truncate tool output before it enters conversation history.
 800 tokens is enough context for most downstream reasoning steps.
 Anything beyond that is usually noise, not signal.
 """
 # Rough estimate: 4 chars ≈ 1 token
 max_chars = max_tokens * 4
 if len(raw_result) <= max_chars:
 return raw_result
 truncated = raw_result[:max_chars]
 # Try to cut at a sentence boundary
 last_period = truncated.rfind('. ')
 if last_period > max_chars * 0.7: # Only cut at sentence if it's not too short
 truncated = truncated[:last_period + 1]
 return truncated + f"\n[Result truncated at ~{max_tokens} tokens. Full result available on request.]"

# Usage inside a tool execution handler
def execute_tool(tool_name: str, tool_args: dict) -> str:
 raw = dispatch_tool(tool_name, tool_args)
 # Different truncation limits per tool type
 limits = {
 "search_web": 600,
 "get_document": 1000,
 "run_sql_query": 400,
 "search_patient_records": 500,
 }
 limit = limits.get(tool_name, 800)
 return truncate_tool_result(tool_name, raw, max_tokens=limit)

On a multi-turn research agent, this single change reduced my average input token count per turn by 35% after the third tool call in a session.

Where These Techniques Break Down

None of these are free. Each comes with a specific failure mode worth knowing before deploying.

Tool schema compression fails when descriptions are so terse the model makes wrong tool-selection decisions. The threshold I’ve found: as long as you keep a clear verb and the primary parameter described, accuracy holds. Going below one sentence per tool tends to degrade routing.

Dynamic tool loading fails when the intent classifier makes wrong calls. If your keyword heuristic routes a billing question to the patient intent, the model will hallucinate tools it can’t actually call. Add a fallback: if the model requests a tool not in the active set, inject it on the next turn and retry rather than letting the error cascade.

Context summarisation fails on tasks that require exact reproduction of previous turns — legal or compliance workflows, for example, where the precise wording of a prior message matters. In those cases, do not compress: use full history and pay the cost, or store the verbatim transcript separately and retrieve it on demand.

Thinking budget control is wrong for novel problems. Capping a complex reasoning task at 500 tokens produces worse answers, not cheaper ones — because the model makes more errors and requires follow-up turns to correct them. The net cost is higher, not lower. Profile which query types actually benefit from extended reasoning before capping.

Where to Start

Under 5 minutes: Open your current agent’s tool definitions and count the tokens in the description fields alone. If any description is longer than two sentences, rewrite it in one. Measure the before/after token count. The number will surprise you.

One afternoon: Implement dynamic tool loading with a keyword-based intent classifier. Start with three intent buckets and no ML required. Measure token spend per turn before and after on your last 50 real conversations.

Production-grade: Add max_tokens ceilings to every LLM call you make — routing decisions, summarisations, structured outputs, and main agent turns. Then profile which query types are triggering extended thinking unnecessarily and add a complexity classifier. The billing drop tends to happen fast, usually within the first week of production traffic.

The expensive parts of an agentic system are almost never where the first optimisation pass looks.

Want to Go Deeper?

I’ve compiled a comprehensive GenAI Interview Prep Guide — 80+ questions with in-depth answers, architecture diagrams, and a 2-week study plan — as a downloadable resource.

👉 Get the Full GenAI Interview Prep Pack on Gumroad
👉 RAG interview questions

References

[1] Silfverskiöld, I. (May 2026). Agentic AI: How to Save on Tokens. Data Science Collective, Medium. https://medium.com/data-science-collective/agentic-ai-how-to-save-on-tokens-9a1571ac6c85

[2] Anthropic (2025). Claude Model Pricing. https://www.anthropic.com/pricing

[3] Anthropic (2025). Extended Thinking: Budget Tokens. Anthropic API Documentation. https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking

[4] Marie, B. (December 2025). Efficient LLMs at Scale: My NeurIPS Week in KV Caches, Spec Decoding, and FP4. The Kaitchup. https://kaitchup.substack.com/p/efficient-llms-at-scale-my-neurips

[5] Shaikh, S. (August 2025). Advanced Structured Outputs & Tools. Medium. https://medium.com/@sadikkhadeer/advanced-structured-outputs-tools-a99d44685b73

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

Author(s): Darshandagaa

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

The Part Everyone Skips: Where the Tokens Actually Go

Six Things to Actually Do

1. Tool schema surgery

2. Dynamic tool loading by intent

3. Output length contracts via `max_tokens` + structured outputs

4. Context summarisation at conversation boundaries

5. Thinking budget control (extended reasoning models)

6. Tool result truncation before re-injection

Where These Techniques Break Down

Where to Start

Want to Go Deeper?

References

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

I Deleted Every Static Claude API Key I Owned. Here’s the Keyless Migration, Provider by Provider.

I Replaced ChatGPT With Local AI for 30 Days. Here’s What Actually Happened.

A Practical Guide to Evaluating a Cloud Migration Partner

AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async

Building Long-Running Claude Managed Agents: Why State Matters More Than Compute

The Building Blocks of LangGraph (Part 0)

Five Ways Claude Code Runs Multi-Step Work. The Two Questions That Pick the Right One.

Choose Wisely: Models Should Follow Your Use Case.

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

Author(s): Darshandagaa

Your Agentic AI Bill Is a Prompt Engineering Problem in Disguise

The Part Everyone Skips: Where the Tokens Actually Go

Six Things to Actually Do

1. Tool schema surgery

2. Dynamic tool loading by intent

3. Output length contracts via max_tokens + structured outputs

4. Context summarisation at conversation boundaries

5. Thinking budget control (extended reasoning models)

6. Tool result truncation before re-injection

Where These Techniques Break Down

Where to Start

Want to Go Deeper?

References

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

3. Output length contracts via `max_tokens` + structured outputs