Beyond the Prompt: Why Autonomous AI Agents Are Replacing the Chatbot
Last Updated on June 8, 2026 by Editorial Team
Author(s): Suchit Majumdar
Originally published on Towards AI.
Beyond the Prompt: Why Autonomous AI Agents Are Replacing the Chatbot
In May 2025, Sebastian Siemiatkowski — the same Klarna CEO who fifteen months earlier had told the world that one OpenAI-powered assistant was doing the work of 700 customer service agents — quietly started hiring humans back. Bloomberg got the quote: “Cost unfortunately seems to have been a too predominant evaluation factor, what you end up having is lower quality.” Headcount over the same window went from 5,527 at the end of 2022 to 3,422 at the end of 2024, per the S-1 Klarna filed in November. The chatbot stayed. The “all-AI customer service” story did not.
So the title of this piece is half a lie, and I want to correct it before you read another paragraph. Chatbots are not, in any general sense, being replaced by autonomous agents in 2026. The replacement is happening in one specific place: queue-shaped back-office work where no human is waiting on the other end, and almost nowhere else. That narrow claim is the thesis. The broad version is what every vendor deck says, and it is wrong. If you walked out of your last AI strategy review thinking the agent wave is about to subsume your support org, your sales org, and your engineering org all at once, you are about to spend the next four quarters defending a budget against numbers that will not arrive.
That is the claim. The rest is me showing my work.
Klarna is evidence for the thesis, in reverse
The 2024 Klarna press release is worth re-reading with an engineer’s eye. 2.3 million conversations in month one across 35 languages. Resolution time from 11 minutes down to 2. A CSAT of 4.4 against a human baseline of 4.2, Klarna’s own number, never independently audited. OpenAI mirrored the case study on its own site. It was the most widely cited “AI replaced humans” deployment of the LLM era.
It was also a chatbot. Not an agent. A user-initiated, real-time, conversational interface with safety rails and a handoff-to-human button. Gergely Orosz pointed this out at the time in his Pragmatic Engineer breakdown: what Klarna had actually built was L1 tier-one support automation, the kind of containment work IVR systems were doing twenty years ago, except now in natural language. The bot was a filter that escalated anything sharp.
Then it broke on the seams chatbots always break on. The May 2025 reporting from CX Dive and CNBC converges on a single picture: hallucinations clustered on edge cases. CSAT cratered on emotional tickets where the bot was technically correct but tonally wrong, because being right and being heard are different jobs. Compliance teams refused to let an LLM autonomously close accounts. So Klarna kept the bot for volume and rebuilt the human layer underneath it, “Uber-style,” remote and flexible, hiring students and rural workers as on-demand specialists.
Read that as a bull case for chatbots if you want. I read it as a warning about the entire customer-facing slice. The most aggressive chatbot deployment in the world, with founder-level air cover and a workforce reduction of nearly 2,000 people, still bounced off the part of the work where a customer was on the line and cared about being there. That isn’t a story about agents replacing chatbots. It’s a story about customer-facing conversation being a category that resists full automation by either shape of system.

Where the chatbot still wins, and it isn’t close
Intercom Fin is the cleanest counter to the “agents will eat customer support” narrative. Self-reported resolution rate of 67% globally as of late 2025, on 40 million cumulative conversations, across more than 10,000 business accounts. Priced at $0.99 per resolved conversation. Intercom claims the human-agent comparison is $5 to $10 per query and I’ll flag that as a vendor-published number, not an audit — but Teneo’s 2025 cost analysis lands in roughly the same range ($8–$15 per fully-loaded human resolution), so the order of magnitude is real even if Intercom is choosing the friendly end.
The caveats matter. “Resolution” is defined by Intercom: the customer exits, or affirms satisfaction, after Fin’s last answer. No public study correlates that signal with actual customer satisfaction. And the variance across accounts is enormous. One Intercom community thread in late 2025 had a customer reporting 27.6% resolution rate next to another at 80.1% over the same 12-week window, with the high performers being the ones who spent two to four weeks cleaning their knowledge base before launch. The published 67% is a marketing mean sitting on a long, ugly tail.
But the unit economics survive every caveat. This is a working chatbot business, at scale, on user-initiated conversational work, with no agent loop in sight. If your Q3 roadmap involves wrapping Fin in a LangGraph orchestrator and rebranding it an “agentic support platform,” the question I would ask in your planning meeting is whether the additional dollars per resolution clear the additional tokens per resolution, because the LeanOps numbers I’ll get to below say they usually don’t. There’s also the Air Canada precedent from February 2024, when the BC Civil Resolution Tribunal made the airline liable for its chatbot’s incorrect bereavement-fare advice. The damages were small, roughly $650 CAD. The precedent is not. Any system, conversational or autonomous, that makes binding statements to a customer creates legal exposure, which is one more structural reason the production migration is happening where no customer sits on the other end of the conversation at all.
What actually has to be true for an agent to pay for itself
Strip away the framework news cycle. OpenAI Agents SDK in March 2025. Google ADK in April. LangGraph 1.0 in October. Anthropic computer use in beta since October 2024 and Microsoft Copilot Studio’s computer-use agents going GA in May 2026. None of that determines whether a production agent earns its keep. What determines it, in every deployment I’ve been able to study at depth, is a set of conditions about the work itself, and I want to walk through them out of order, the way they actually surface when you’re staring at a roadmap.
Start with the deepest one: per-task economic value. This is what most teams underprice. Microsoft Research published “How Do AI Agents Spend Your Money” in April 2026 and the headline finding was that agentic coding tasks consume “1000x more tokens than code reasoning and code chat.” That’s a useful shock number, but the LeanOps breakdown from May 2026 is the one I keep pasting into planning docs. A 5-step Claude Sonnet 4.6 agent loop runs about $0.158 against $0.049 for the same task as a single chat call. So a 3.2x multiplier at five steps. By 50 steps the multiplier exceeds 30x; by 200, which LeanOps describes as a typical debugging session, it exceeds 100x. If the task is worth $0.40 to your business, an agent is structurally unaffordable no matter how good the model gets. If the task is a claim payout, a KYC review, or an AP invoice — work worth $40 to the business at minimum — the math reverses. The Microsoft prompt-length data tells the same story from a different angle: average prompt length grew roughly fourfold from 2024 to 2025, from about 1,500 tokens to over 6,000, almost entirely because of agentic and reasoning workflows. The token bill is not coming down faster than the loops are getting longer.

The second condition is that the work has to be queue-initiated rather than user-initiated. Claims arrive. Invoices arrive. KYC review flags arrive. Nobody on the other end is staring at a loading spinner waiting on the agent’s 12th tool call, which means your latency budget is measured in minutes or hours and you can afford a long loop with retries. That single property is what makes the math from the paragraph above survive contact with production. Try the same loop on a customer-facing chat session and you ship a worse Klarna.
The other two conditions are less interesting individually but they govern whether the workflow is actually buildable. You need a success signal that’s deterministic or close to it. Did the claim get paid. Did the invoice match a PO within tolerance. Did the ticket stay closed for fourteen days. You can write a unit test for “the work got done,” and that’s what lets you regression-test the agent on real traffic instead of guessing from CSAT noise. And you need a bounded tool surface — a claims agent reads a policy database, runs fraud models, calls a payment API, escalates on exception. Four tools, all owned, all logged. Compare that to a “general-purpose research agent” with the open web, a terminal, a code interpreter, and Slack. The first runs at per-step accuracy north of 95% in production. The second has an unbounded failure space, and the compounding math is unforgiving: 0.95 to the 20th power is roughly 36% end-to-end success, and that’s the optimistic version where errors aren’t correlated. Real agents have correlated errors. The number is worse.
Here is the whiteboard I’d draw in your Q3 planning meeting. Two axes. Queue-initiated versus user-initiated on one. Bounded tool surface versus open-ended on the other. The top-right quadrant, queue plus bounded, is where you build an agent and watch it earn its tokens. The bottom-left, conversational plus open-ended, is where you ship a chatbot if you ship anything. Everything else is the dead zone. MIT’s NANDA “GenAI Divide” report from August 2025 found that 95% of enterprise generative AI pilots failed to deliver measurable P&L impact. The report doesn’t draw my quadrant chart; that’s my read on the underlying data. But the report’s root-cause list (no defined measurable outcome before deployment, missing integration, weak governance) describes a portfolio of pilots that mostly lived in the dead zone, and the firms shipping in the queue-plus-bounded corner aren’t the ones in the 95%.
What “queue-shaped” looks like in production
Lemonade is the case I would put in front of a board. As of December 31, 2024, AI Jim took 96% of first notices of loss without human involvement, and 55% of claims were fully automated from intake to payment. Record settlement: 2 seconds. That comes from the 10-K and the Q4 2024 shareholder letter, not a press release.
Watch what AI Jim is not doing. It is not having a conversation. It processes an intake form, runs a fraud model against a policy, and either issues a payment instruction or routes the claim to a specialist matched by skill, workload, and schedule. The customer experiences “2 seconds” because the architecture happens to be fast, but the architecture would still work if it took six minutes, because the customer’s expectation is “my claim got handled,” not “someone is replying to me right now.” Queue shape at the core. And here’s the honest hedge: Lemonade’s combined ratio sat in the 70s-to-80s range through 2024 and into 2025 per its 10-K filings, which is elevated, and which tells you AI Jim is doing real work on intake fraud and cycle time but is not — repeat, not — fixing underwriting. Anyone selling you AI claims as a loss-ratio miracle is selling you a different product than what Lemonade actually built.
The IT-ticket case data tells the same story with less drama. Moveworks at Hearst: 57% of support issues resolved autonomously, 300 account unlocks per month, 1,200 questions answered. Mercari: 74%. ServiceNow Now Assist hit $600 million ARR by Q4 2025, projected past $1 billion by end of FY2026, the fastest-growing line in ServiceNow’s portfolio. These are vendor-reported numbers. I could not find an independent audit on any of them, and I’d be skeptical of the decimal places, but the pattern is consistent enough across deployments that I trust the shape.
On the legal side, EvenUp drafts personal-injury demand letters at roughly $50 per letter against $200–$500 of paralegal time, processed over 100,000 cases in 2025, and closed a $150M Series E in October. AP invoice processing is the most boring example and possibly the most important one: best-in-class teams hit 52.8% touchless processing in 2025 (up from 47.2% the year before), Deloitte’s Basware deployment reportedly reached 89%, and cost per invoice drops from somewhere between $12.88 and $19.83 manually to about $2.36 electronically. None of these workflows involve anyone talking to anything.
KYC is where I’d plant a flag and refuse to walk further. Alloy launched perpetual KYC with Parcha and Greenlite in September 2025, and the MANTL-Alloy partnership processed 2 million applications by December. But I could not find audited straight-through-processing numbers. Every announcement says “increased” without a baseline. Treat the category as believed-promising-but-unverified. It fits all four conditions on paper. The published evidence has not caught up.
The coding-agent counter-case, honestly
The strongest objection to everything above is that coding agents are scaling faster than my thesis allows for. The objection has teeth and I want to address it directly rather than wave it away. Cognition went from $1M ARR in September 2024 to a $492M ARR run rate by May 2026, and just raised $1B at a $26B valuation. Devin’s 2025 review reports PR merge rates moving from 34% to 67% year over year, security fixes at 1.5 minutes against 30 for humans, Java migrations at 14x. Cursor claims 35% of its own merged PRs come from background agents. LangGraph reports 400+ production deployments across Cisco, Uber, LinkedIn, BlackRock, JPMorgan. These are real dollars and they’re moving fast.
Two things in response, neither of them dismissive.
Devin’s numbers are Cognition’s own and have not been re-validated on an independently administered benchmark since the 2024 SWE-bench controversy, when the demo tasks turned out to be easier than the marketing implied. More importantly, if you read Cognition’s own description of where Devin works best — “4–8 hour tasks with clear upfront requirements and verifiable outcomes” — that is the exact definition of queue-shaped work in coding clothes. A Jira ticket with a clear acceptance test is a queue item. A PM walking over to ask “can you take a look at this?” is not. Cursor’s 35%-of-merged-PRs-from-background-agents stat actually strengthens my point: background agents run against a queue. They aren’t the conversational copilot quadrant. So part of the coding-agent boom is real evidence that my thesis is right, just dressed in IDE clothing.
The other thing is Uber, and this is the number every agent-platform planning team should be staring at. Uber rolled out Claude Code in December 2025, doubled usage by February 2026, and burned the entire 2026 AI budget in four months. Q1 2026 R&D spend hit $951 million, up 17% year over year. About 10% of code committed at Uber is now agent-written. Uber’s COO Andrew Macdonald said the quiet part on stage: “That link is not there yet,” referring to the connection between AI spend and shipped product value. I want to be honest about what this means. The proximate cause was governance, specifically internal leaderboards that ranked teams by tool usage rather than output. So Uber’s burn isn’t a clean failure of the per-task-economics condition. It’s a failure of governance that masked whether the per-task economics were ever holding. That is a more uncomfortable story than “agents don’t pay,” and it’s the story I’d put on the cover. If your incentive system measures token consumption instead of merged value-bearing work, you will reproduce Uber’s $951 million quarter at your own scale, and no model upgrade will save you.
So where does that leave the thesis? Coding agents are scaling. Coding agents are also, in the single largest deployment outside a vendor’s own walls, failing to demonstrate the per-task value link that the rest of the queue-shaped category demonstrates cleanly. I don’t know yet which direction this resolves. What I would watch is whether Cursor’s 35% number holds independently audited and whether Cognition’s $492M ARR shows up as durable seat-expansion in the next two quarters, or as a spike that traces back to enterprise pilot budgets that won’t renew at the same level once finance owns the line item.
What I would actually build, and what I wouldn’t
If I were sitting in your Q3 review, here is the bet I’d defend. Find the workflows where work arrives in a queue, the success signal is binary or close, the per-task value to the business is north of $5, and the tool surface fits on an index card. Be ruthlessly specific about it. Not “customer operations” but “L1 disputed-charge triage where the merchant category code, the chargeback reason, and the prior transaction history together determine the outcome 80% of the time.” That’s your AI Jim. Build it. Instrument it on the deterministic success signal from day one. Refuse to ship until the per-step accuracy times the step count clears 80% end-to-end. Anything customer-facing and conversational stays on a chatbot, priced like Fin and staffed with a human escalation lane that you do not pretend you can amortize away. The Air Canada ruling is not going away.
What I would refuse to build is an “agent” in the middle quadrants. Open-ended conversational work, or queue-shaped work without a clean success signal. That is where the 95% of failed pilots in the MIT report live, and no choice of framework, no model upgrade, no prompt-engineering pattern fixes a workflow whose value-per-step is below its token-per-step.
Sierra is at $15B in May 2026. Decagon at $4.5B in January. Parloa at $3B. Those valuations encode a bet that the queue-plus-bounded category is durably large. The next round of agent-platform valuations gets earned, or doesn’t, by whether the named customer lists (SoFi, Ramp, Brex on Sierra; Avis, Mercado Libre, Deutsche Telekom on Decagon) show up in 2026 and 2027 10-Ks as P&L line items, not as press releases. That is the actual question for 2026. Not whether the chatbot dies. Whether the next $15B of agent-platform value gets earned by ten more Lemonades, or by ten more Ubers.
I know which one I would underwrite. I do not yet know which one the market will.
Originally published at https://suchitmajumdar.substack.com.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.