We Replaced ChatGPT With a Local AI Server. Six Months of Honest Data.
Last Updated on June 18, 2026 by Editorial Team
Author(s): Services Ground
Originally published on Towards AI.
We Replaced ChatGPT With a Local AI Server. Six Months of Honest Data.

This is not a “local AI is better” argument.
It is a data argument.
Six months ago, a number stopped me mid-scroll: Qwen 2.5 Coder 32B scored 92.9 on HumanEval. GPT-4o scored 90.2.
HumanEval is the industry-standard coding benchmark — 164 programming problems across languages and problem types, designed to measure real code generation capability. It is not perfect, but it is the closest thing to an objective apples-to-apples comparison the field has.
A free, open-source model running on consumer hardware had just outperformed the model our team was paying $30 per user per month for. On the benchmark that matters most for our use case.
That number demanded an honest audit of what we were actually paying for.
What followed was six months of running both systems in parallel, tracking outputs against real tasks, measuring costs, and documenting the surprises. This article is that documentation — with the honest failures alongside the wins.
The Audit: What We Were Actually Paying For
Before building anything, we mapped every AI task our team of ten performed in a typical week.
The breakdown was more lopsided than expected:
- ~45% writing tasks — emails, documentation, summaries, proposals
- ~30% coding tasks — debugging, code review, function generation, test writing
- ~15% analysis tasks — data interpretation, structured reasoning, research synthesis
- ~10% edge cases — tasks requiring real-time information, highly specialized reasoning, or frontier-level capability
The critical insight from this audit: the 10% of tasks that genuinely required frontier-level intelligence were subsidizing the 90% that didn’t. We were paying per-user-per-month pricing for tasks where a local 14B model would produce output we couldn’t reliably distinguish from GPT-4o.
This is the framing that matters. The question was never “is local AI better?” It was “for the specific distribution of tasks our team performs, does the quality delta justify the cost delta?”
The honest answer for our team: no. Not at $300/month scaling indefinitely with headcount.
The Hardware Decision
We selected an RTX 3090–24GB VRAM, purchased used for $600.
The 24GB threshold is the critical inflection point in the local AI hardware tier because it is the minimum required to run 32B parameter models with Q4 quantization. Below 24GB you are running 14B models, which are capable but noticeably weaker on complex multi-step tasks.
The full hardware VRAM tier picture:
Hardware VRAM Max Model (Q4) Quality Tier CPU only 16–64GB RAM 7B (3–8 tok/s) Acceptable for simple tasks RTX 3070 / 4060 Ti 8GB 7B–8B Good for daily tasks RTX 3080 / 4080 16GB 13B–14B Strong, near-frontier on most tasks RTX 3090 / 4090 ✅ 24GB 32B–34B Competitive with GPT-4o on benchmarks Dual 3090 / A6000 48GB+ 70B full Frontier-adjacent
Total infrastructure cost: ~$1,200 including the GPU, a used workstation, and 2TB NVMe storage. Break-even against our previous ChatGPT Team subscription: four months.
The Model Stack
We ran every major open-source model against our actual task distribution before settling on the final stack. Here is what we landed on and why each choice was made.
General Tasks — Qwen 2.5 14B
Pull command:
ollama pull qwen2.5:14b
Handles writing, email drafting, summarization, analysis, and Q&A. Fits in 9GB VRAM with Q4 quantization, leaving 15GB headroom for other processes or concurrent requests.
The quality surprise: on writing tasks — the category where we expected the largest gap — we could not reliably distinguish Qwen 2.5 14B output from GPT-4o output in blind testing. The model’s instruction following is strong, tone control is accurate, and output length calibration is consistent.
This is the default model. Most daily queries never need anything larger.
Coding Tasks — Qwen 2.5 Coder 32B
Pull command:
ollama pull qwen2.5-coder:32b
The benchmark data holds in production. This model handles Python, TypeScript, Go, Rust, SQL, and shell scripting with genuine competence — idiomatic output, correct function signatures, accurate debugging explanations.
It uses ~20GB VRAM at Q4, leaving minimal headroom on a 24GB card. This means it does not run simultaneously with other large models — Ollama swaps it in on demand and evicts the previous model. The swap latency is 3–5 seconds on NVMe storage. Acceptable for a team that isn’t running multiple models simultaneously.
HumanEval comparison for context:
Model HumanEval VRAM (Q4) Cost Qwen 2.5 Coder 32B 92.9 20GB Free GPT-4o 90.2 — $20+/mo DeepSeek Coder V2 Lite 90.2 10GB Free Qwen 2.5 Coder 7B 83.5 5GB Free
Reasoning Tasks — DeepSeek R1 14B
Pull command:
ollama pull deepseek-r1:14b
DeepSeek R1 uses a chain-of-thought architecture that externalizes its reasoning process before committing to an answer. The visible reasoning trace is not cosmetic — it produces measurably more accurate results on multi-step analytical tasks compared to standard instruction-following models of the same size.
The tradeoff is speed. R1 generates its reasoning chain before producing a final answer, which adds latency. For tasks where accuracy matters more than speed — structured analysis, complex data interpretation, multi-constraint planning — it is the correct tool. For quick tasks, Qwen 2.5 7B is faster.
Voice Pipeline
Speech-to-Text:
pip install faster-whisper
# Or via Ollama:
ollama pull whisper
Whisper Large v3 Turbo achieves under 3% word error rate on clean audio — the same quality tier as OpenAI’s paid Whisper API. It runs on 6GB VRAM for real-time processing or CPU for batch transcription. The paid API costs per minute. The local version costs nothing per minute after hardware.
Text-to-Speech:
pip install kokoro
Kokoro (82M parameters) runs entirely on CPU. It produces natural-sounding speech that reviewers consistently rate above models ten times its size, with under 200ms time-to-first-audio on modern hardware. The GPU stays fully allocated to the LLM layer — Kokoro consumes no VRAM.
Document Q&A — RAG with nomic-embed-text
Pull command:
ollama pull nomic-embed-text
nomic-embed-text is the embedding model that enables RAG — Retrieval Augmented Generation. It converts documents into searchable vector representations stored in Qdrant, enabling the AI to retrieve relevant content from your knowledge base before generating responses.
At 0.3GB VRAM it runs alongside any other model without meaningful resource impact. Every team server should have this pulled.
The quality difference RAG makes is not marginal. A local 14B model answering questions against your actual product documentation, meeting notes, and project files produces more accurate business-specific answers than GPT-4o answering cold — because context dominates model quality on domain-specific queries.
The Interface
docker run -d \
--name open-webui \
--restart always \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
Open WebUI provides individual accounts, conversation history, document upload, model switching, and voice input through a browser interface identical to ChatGPT. Team members access it from any device. No installation on their machines. Nobody noticed the switch.
The Three Surprises
Six months of production use produced three findings that were not predictable from benchmarks alone.
Surprise 1 — The Quality Gap Is in the Wrong Place
The assumption going in: local models would struggle most on writing — nuanced tone, creative tasks, complex editing.
The reality: Qwen 2.5 14B handles writing at a level we cannot reliably distinguish from GPT-4o on the majority of business content. In blind output comparisons across 40 writing tasks, team members correctly identified which output came from the local model at slightly above chance rates — not statistically significant.
Where the gap is real: tasks requiring knowledge more recent than the model’s training cutoff. Local models have no web access by default. For current events, recent API documentation, and live data queries, local models fail.
Our solution: a web search MCP server for research tasks, and a cheap API fallback (DeepSeek V3 at $0.27 per million input tokens) for tasks that genuinely need frontier reasoning. Total external AI spend dropped from $300/month to $22 last month across ten people.
Surprise 2 — The Routing Problem Is the Real Engineering Work
Setup time for the full stack — Ollama, Open WebUI, pulling models, configuring RAG, installing Tailscale for remote access — was one afternoon for someone comfortable with a terminal.
The engineering work that took two weeks: routing. Deciding which tasks go local, which go to API fallback, and making that decision invisible to team members.
The routing matrix we landed on:
Quick tasks, emails, summaries → Qwen 2.5 7B (local, free, fast)
Complex writing, analysis → Qwen 2.5 14B (local, free, quality)
All coding → Qwen 2.5 Coder 32B (local, free, best)
Multi-step reasoning → DeepSeek R1 14B (local, free, accurate)
Agentic workflows → Qwen 2.5 32B (local, free, tool use)
Current info / hard edge cases → DeepSeek V3 API ($0.27/M tokens)
This is implemented as model options in Open WebUI. The default is Qwen 2.5 7B. The dropdown includes all local models and one API fallback labeled “Best Quality (API)”. Most team members use the API fallback a handful of times per week.
Surprise 3 — Model Matching Matters More Than Model Quality
The single largest quality improvement in our setup came not from better hardware or larger models but from task-model matching.
Running the wrong model for a task does not produce obviously bad output — it produces plausible output that is subtly wrong at the same speed and confidence as correct output. A general model handling a complex algorithm design task produces reasonable-looking code that fails edge cases. A coding model handling a strategic analysis task produces structured output that misses the nuance.
The mental model that corrected this: models are tools. You do not use a general-purpose tool for every task when specialized tools are available and cost the same.
After implementing explicit model routing, output quality on coding tasks improved measurably — fewer iterations, fewer bugs caught in review. Not because the model changed but because the right model was being used.
The 20% Where Local Falls Short
Intellectual honesty requires being specific about the failure cases.
Real-time information. Local models have training cutoffs. For tasks requiring current market data, recent technical documentation, or live information, web search via MCP or API routing is required.
Highest-complexity reasoning. On genuinely hard problems — novel algorithm design, complex multi-domain research synthesis, tasks where a wrong answer has significant consequences — GPT-5 class models produce noticeably better output. This represents a small fraction of our actual workload but it exists.
Experimental capabilities. When team members want to test the newest model features — multimodal reasoning, extended thinking, latest API capabilities — the frontier providers have them first.
These three categories represent approximately 15–20% of our team’s AI usage. We pay for them selectively at per-token rates that are trivial against what we were spending on flat subscriptions.

The Honest Cost Model
For a team of ten, three-year comparison:
Year 1 Year 2 Year 3 Total ChatGPT Team $3,600 $3,600 $3,600 $10,800 Local server $1,920* $360 $360 $2,640
*Hardware $1,200 + electricity/API $720
Savings over three years: $8,160 for a ten-person team.
The savings compound as the team grows. A twenty-person team would pay $7,200/year for ChatGPT Team. The local server cost does not change with headcount — the same hardware serves five people or fifty (with appropriate concurrency configuration).
What This Is and Is Not
This is a documented case study of one team’s migration, with real numbers and real failure cases.
It is not a universal argument. A solo developer with occasional AI use and no technical infrastructure support has a different calculation. A team requiring the absolute best model quality on every task has a different calculation. A company with strict cloud-only IT policy has a different calculation.
The argument being made is specific: for teams using AI regularly across a predictable distribution of tasks, where the monthly bill has become noticeable, and where at least one person can manage a Linux server — the open-source model ecosystem in 2026 is good enough that the math has changed.
The quality gap between local models and frontier models has closed on the 80% of tasks that constitute most business AI usage. The remaining 20% is addressable with selective API fallback at costs that are a fraction of blanket subscription pricing.
Running the numbers honestly, with the real task distribution your team has — that is the calculation worth doing.
Follow for more practical guides on local AI infrastructure, model selection, and production deployment.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.