I Built an App Entirely with Claude Code. This is What I Learnt

Last Updated on May 29, 2026 by Editorial Team

Author(s): Venkat Peri

Originally published on Towards AI.

I Built an App Entirely with Claude Code. This is What I Learnt

Across 26 calendar days, I logged 97 active hours and processed 9.9 billion tokens of Claude Code on a single project. The output is an app currently in production pilot, used daily by colleagues in real workflows. I built this in evenings and weekends, working around my day job.

The 97-hour number deserves clarification. Most of it was Claude working autonomously while I watched the chat, dispatched the next thing, or stepped away to watch a video or work on something else with the window visible. My active attention was probably a quarter to a third of that. Claude did the typing. I did the architecture, the dispatching, and the review.

The numbers below come from JSONL session logs and the project’s git history, not estimates.

The thesis

What is worth writing about is the texture of how the work happened and what it cost. Bursty short blocks beat long deep-focus sessions. Rework was 6.5% of token spend. 78% went to code that shipped. The mode and rhythm are replicable. The hour or two of daily attention I gave it is the part that depends on your circumstances.

The mode

A specific way of working showed up early and stuck. One Claude Code desktop chat at a time. Opus 4.7 in High mode (extended thinking). Background tasks only for /ready polling after deploys. That was the entire toolset.

The flow: state the problem with concrete evidence (URLs, counts, expected-versus-observed), pick from the options Claude proposed, watch Claude execute. Most of the wall time was Claude working autonomously. I watched the chat when something was about to land, dispatched the next thing when Claude finished, or stepped away to watch a video or work on something else with the window visible.

This is the part of the workflow people get wrong when they read “97 hours” and assume it means 97 hours of intensive coding. It does not. It means 97 hours during which the chat was running work, of which a meaningful fraction was Claude churning through extended autonomous tasks while I was looking at something else. The leverage was not me grinding harder. The leverage was that Claude Code can run long enough on a single dispatch that the user can disengage.

The natural unit was a 33-minute block, median 14 minutes. Short blocks were quick reads and course-corrects. Long blocks over an hour were feature-shipping work where Claude churned through implementation while I checked in periodically. I did not force focus sessions. The natural rhythm of ship-verify-queue produced 30-minute bursts.

Peak hour was 5pm Eastern, immediately after primary-job work ended. A second cluster sits at 7–8am before primary work started. The fit-in-the-cracks part worked because I was not actively engaged for most of the wall time. Claude was.

The leverage came from prompting discipline and Claude Code’s ability to run autonomously for minutes at a stretch. The toolset was deliberately simple.

The texture

Classification of 179 work blocks into seven activity categories, by token spend:

Feature work: 69.8% (6.9B tokens), 72 blocks
Bugfix: 10.4% (1.0B), 21 blocks
Deploy and verification glue: 9.8% (966M), 25 blocks
Rework: 6.5% (648M), 6 blocks
Review and exploration: 3.0% (299M), 42 blocks
Refactor: 0.4% (43M), 2 blocks
Other: ~0%, 11 blocks

78% of token spend went to shipping work (feature plus bugfix plus refactor). Deploy glue is necessary overhead. Review blocks are exploration, fast and cheap. Rework is rare and heavy when it happens: average 108M tokens per block, the highest of any category.

The bimodal pattern matters for anyone trying to copy the mode. Many short blocks (5–15 minute exploration and verification cluster) and a smaller number of long blocks (1–5 hour feature-shipping cluster). Almost nothing in between. The rhythm is tight bursts of agentic execution between verification gates. Long uninterrupted focus sessions did not appear in the data.

The marathon weekend (Friday May 8 through Sunday May 10) shipped what is probably 4–6 weeks of normal work: 25.4 hours of chat-active wall time, around 1 billion tokens, 12 merged PRs covering the multi-tenant auth foundation, the workspace and invite system, and a full multi-session refactor across two design-coupled PRs. The compression factor came from sustained dispatching over a window where the architecture was clear and Claude could run continuously.

Three rework stories

The 6.5% rework number deserves three specific stories. “AI gets it wrong half the time” is the most common objection to this kind of work, and the data tells a different story.

The architectural pivot

The app’s post-meeting pipeline started as two sequential Sonnet calls: a validation pass that cleaned highlights, then a notes generation pass that synthesized them. Validation took 30–60 seconds, notes another 10–30. Users clicking notes during that window saw a loading screen. After several local cleanups (the “let me clean this up” kind of small mistakes), I pushed back on the design itself: this validation is killing us, shouldn’t it run atomically with the notes generation?

Claude responded with three options and the tradeoffs of each: fire-and-forget validation, drop validation entirely, or inline validation into the notes prompt as a JSON trailer. I picked the third. Claude refactored the pipeline to one Sonnet call instead of two, retired the separate validation module, and removed the 120-second polling loop the streaming endpoint had been waiting on.

The story here: AI laid out the tradeoff space, I made the call, AI executed the refactor. The judgment about which tradeoff to take was mine. The architectural decomposition was the AI’s.

The wrong diagnosis

Cross-pod 404s on historical session views. Claude diagnosed: pods have separate filesystems, so the file lookup hit the wrong pod. Shipped a Redis snapshot fallback and a boot-time seeder. Three commits.

I asked offhand: I thought both filesystems were mounted from the same EBS volume? Claude verified, came back, and said directly: you’re right, both pods share the EBS bind mount, my separate-filesystems diagnosis was wrong, it was a shell-glob expansion issue in an earlier docker exec ls where the glob expanded on my local Mac, not in the container.

The over-engineered fix stayed in because it is harmless redundancy. The actual cost was 15 minutes of wasted reasoning.

This is the most quotable rework moment in the data. Real engineering mistake (debugging through a local shell, misreading the deployment topology). Real recovery (verify, admit, keep the harmless work, name the actual cost). What makes it richer: I caught it with a five-word “oh wait” prompt.

The clean revert

Shipped a new output_media.camera payload to give the meeting bot a custom avatar. Production rejected it with a 400, blocking new meeting joins. Claude's response: the new payload made the API 400, probably wrong shape or unsupported on our plan tier, let me revert immediately and add response-body logging so future failures are diagnosable. Reverted in two minutes. Kept the diagnostic logging since it is useful for any future API failure. Textbook tried-something, didn't-work, backed-out-without-drama.

What the three have in common

Real engineering coordination across different scales. Architectural pivots where the AI proposes the option space and you decide. Honest mistake-and-recovery where the AI admits a wrong diagnosis after you push back. Clean reverts where production rejected a change and the AI rolled it back without ego.

What changed about my role

The shift was from writing code to making architectural decisions in concrete prompts. The work that mattered was stating the problem clearly, picking among options the AI proposed, and steering toward the right path when it started drifting.

I learned to spot the wrong-path pattern early. Same file edited five times in fifteen minutes, repeated grep checks against the same module, an assistant response that hedges across two approaches: those are signals to interrupt and re-scope. Catching this at five minutes saves 90 minutes downstream.

The second habit was feeding Claude the diff between expected and observed. “Summarize today returned one meeting but there were three” works. “Notes are broken” does not. The bug-report-as-Jira-ticket muscle is the lever. Across the project, my best prompts read like incident reports with URLs, counts, and expected-versus-actual diffs.

The third was reading PR descriptions critically. Claude writes detailed PRs by default, including a “What’s NOT in this PR” section that names the deferred work. That section is often where the next bug lives. Reading it carefully, including the parts where Claude says “I’d hold on Phase 3c until there’s a concrete need,” kept the architecture from accumulating obvious follow-up debt.

Two surprises

Bursty short blocks beat long deep-focus sessions. The average block is 33 minutes, median 14. The mode that produced the highest output was short dispatches with checks between them. For comparison, the marathon weekend shipped a quarter of the project’s tokens in three days, and the per-block average inside that window was still around 30 minutes. The pace was sustained, the unit was small, my own active engagement was a fraction of the wall time.

Average block duration, by activity: 2 tiers: shipping work 30–50 min, exploration and quick fixes 3–10 min.

The rework number was 6.5%, well below what I expected. Going in, I assumed I would spend significant time fixing what the AI got wrong. The data says 6.5% of tokens went to rework blocks. The other 78% went to code that shipped. The “AI hallucinates code” objection does not survive contact with the JSONL.

What I gave up

The section I am least sure about, and the one another reader should engage with carefully.

Code I did not read, at all. If I have to debug a layout regression six months from now, my ramp-up cost will be higher than if I had typed every line myself.

A specific architecture decision I would have made differently. The session model went through three phases of refactoring: PR A (cookie auth, single-session preserved), PR B (multi-session refactor), then Phase 3b (live writes during the meeting). Each phase was correct given the constraints visible at that step. Looking back, I might have started with the multi-tenant assumption from the beginning rather than retrofitting it. Claude was right at every step; my mental model was incomplete at the early steps. Whether that is a Claude problem or a me problem is unclear.

The team-of-two question. The mode I used assumes a single driver. Sharing a Claude Code project across two engineers introduces a context-sharing problem I have not solved. Worktrees help, but the architectural memory lives in conversation history, and conversation history does not parallelize to a second person without effort. I do not know whether this mode scales past one driver.

The maintenance-cost question. If Claude were unavailable starting tomorrow, the cost of debugging and extending this codebase is meaningfully higher than if I had built it the traditional way. The bet implicit in this kind of work is that the tool will continue to exist and improve. That is a real assumption to make explicit.

The bar

Worth stating explicitly because the default assumption is “everyone can do this now.”

The replicable parts: one Claude Code chat at a time with Opus High, background tasks for deploys and verification, the 33-minute block rhythm, the pre-work and 5 pm timing windows, the rigorous bug-report-as-Jira-ticket prompting. Any senior engineer with a clear sense of what they want can adopt these tomorrow.

The non-replicable parts: an hour or two of focused dispatching and review per day on top of a primary job, the architectural judgment to spot a wrong path at five minutes, the discipline to push back when the AI proposed a design I disagreed with. The bar is judgment plus the time to apply it. Access to Claude Code is the cheapest part.

For someone earlier in their career, the bar is different. Without the architectural taste to steer with, the same tool produces less coherent output, more rework, and a codebase that is hard to maintain. The right path is probably to use Claude Code as a learning amplifier: read its PRs as design documents, study the tradeoffs it names. That earns the taste needed before treating it as a velocity multiplier.

On greenfield

This was a greenfield project. Building something new where I owned the architecture from the first commit. The numbers in this piece come from that context.

How much of this transfers to brownfield work (an existing codebase, established patterns, tech debt, teammates) is hard to answer from this data alone. The mode probably transfers. So does the bug-report-as-Jira-ticket prompting and the PR-description-as-design-document discipline. The 4-to-6-weeks-in-a-weekend compression factor probably does not. Greenfield gives you architectural clarity that lets execution parallelize. Brownfield work asks both you and Claude to read and pattern-match before writing, which changes the math on every dimension of this article: more tokens per block, more verification overhead, less of the compression that greenfield’s clarity enables.

The honest version of this article would be a brownfield follow-up six months from now. I do not have that data yet.

What I would change next time

I would start with the multi-tenant assumption from day one rather than retrofitting workspaces into a single-tenant model. The single biggest source of rework in this project was that retrofit.

A CLAUDE.md project guide should land at the start of the project. Mine landed at week three. The guide for this project covers RLS posture, deploy mechanics, session lifecycle, and known broken patterns. Every project of this kind benefits from a written-down version of what would otherwise be conversation context, accessible to every new session.

Investing in evals earlier would have caught several bugs that surfaced in production. Most bugs in this project were infrastructure shape (stale Redis snapshots, audience-cache key mismatches, validation state propagation) rather than LLM output quality. A pytest harness with workspace and session fixtures running end-to-end checks on every deploy would have caught several of them.

The thing I would not change: the mode. One chat, Opus High, 33-minute bursts, evenings and pre-work, dispatch and review. That cadence ran for 97 hours over 26 calendar days while I held down a primary job. The unit was small, the pace was sustained, the discipline was that Claude shipped the code and I shipped the architecture.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

I Built an App Entirely with Claude Code. This is What I Learnt

Author(s): Venkat Peri

I Built an App Entirely with Claude Code. This is What I Learnt

The thesis

The mode

The texture

Three rework stories

The architectural pivot

The wrong diagnosis

The clean revert

What the three have in common

What changed about my role

Two surprises

What I gave up

The bar

On greenfield

What I would change next time

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

I Deleted Every Static Claude API Key I Owned. Here’s the Keyless Migration, Provider by Provider.

I Replaced ChatGPT With Local AI for 30 Days. Here’s What Actually Happened.

A Practical Guide to Evaluating a Cloud Migration Partner

AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async

Building Long-Running Claude Managed Agents: Why State Matters More Than Compute

The Building Blocks of LangGraph (Part 0)

Five Ways Claude Code Runs Multi-Step Work. The Two Questions That Pick the Right One.

Choose Wisely: Models Should Follow Your Use Case.

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

I Built an App Entirely with Claude Code. This is What I Learnt

Author(s): Venkat Peri

I Built an App Entirely with Claude Code. This is What I Learnt

The thesis

The mode

The texture

Three rework stories

The architectural pivot

The wrong diagnosis

The clean revert

What the three have in common

What changed about my role

Two surprises

What I gave up

The bar

On greenfield

What I would change next time

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement