GEPA: How to Let an LLM Rewrite Its Own Prompts (and When It Actually Helps)

Last Updated on June 22, 2026 by Editorial Team

Author(s): Samarth Banodia

Originally published on Towards AI.

GEPA: How to Let an LLM Rewrite Its Own Prompts (and When It Actually Helps)

Manual prompt engineering is a loop you know too well: write a prompt, run it on a few examples, eyeball the failures, tweak some wording, repeat. It’s slow, it doesn’t scale across a dozen agent prompts, and the improvements are guesses. GEPA’s pitch is to hand that entire loop to an LLM — and the surprising part is how few runs it needs to beat reinforcement learning at it.

This post is a practitioner’s tour of GEPA — what the algorithm actually does, the one idea that makes it work, what the numbers look like, and (the part most write-ups skip) the cases where it won’t help you. If you build LLM systems and you’ve heard the name without quite getting how it works, this is for you.

The one-sentence version

GEPA (Genetic-Pareto) is a prompt optimizer: you give it a system with one or more prompts and a way to score outputs, and it evolves those prompts for you by having an LLM read the execution traces, reason in plain language about what went wrong, and write better prompts. It ships as dspy.GEPA in DSPy and as a standalone gepa package.

That’s the whole idea. The interesting question is why reading traces in natural language beats the standard RL approach — so let’s start there.

The problem GEPA is reacting against

The dominant way to adapt an LLM to a task with feedback is reinforcement learning — methods like GRPO. The mechanism is essentially brute force: sample thousands of trajectories, collapse each one into a single scalar reward, estimate a policy gradient from those numbers, and nudge the weights. It works, but it throws away almost everything interesting.

Think about what that scalar reward discards. Your agent ran, called three tools, produced a chain of reasoning, and got the answer wrong. RL records: reward = 0.0. It learns nothing about why — which tool call was malformed, which reasoning step went sideways, which instruction was ambiguous. All that diagnostic signal, sitting right there in the trace, gets compressed into one number.

For teams calling expensive APIs or working with small evaluation budgets, this is doubly painful: you pay for thousands of rollouts and you waste the richest part of each one.

The core idea: language is a richer learning signal than a number

GEPA’s bet is that the interpretable trace is a far better teacher than the gradient. Instead of asking “what’s the reward?”, it asks the question a human engineer would: what specifically went wrong in this run, and how should I change the prompt to fix it?

The loop looks like this:

Run and trace. Execute the current prompt on a minibatch and capture the full trajectory — inputs, reasoning, tool calls, tool outputs, errors, and the evaluation score.
Reflect. A reflection LLM reads those traces in natural language and diagnoses the failure modes (“the query was too broad and retrieved off-topic passages,” “it hallucinated a proof when the statement was false”).
Mutate. It proposes a targeted edit to the prompt that addresses that specific diagnosis — not a random perturbation, an informed one.
Select and repeat. The new candidate is scored and added to a pool, and GEPA picks what to evolve next.

This is the difference between random search and GEPA: the mutations are intelligent because the proposer LLM saw exactly what broke. One reflective update can sometimes produce a large jump, because it’s not stumbling toward an improvement — it’s reasoning its way to one.

There’s a lovely qualitative finding buried in the research here. As GEPA optimizes, prompts tend to evolve from telling the model what to do toward coaching it on how to do it — accumulating domain knowledge and guardrails the way a human expert would. In one math benchmark, evolved prompts started referencing specific strategies like Eisenstein’s Criterion for minimal polynomials, and added explicit protocols for handling false statements to stop the model hallucinating proofs. The optimizer is, in effect, writing down expertise it discovered through trial and error.

The “Pareto” half: how it avoids getting stuck

The “genetic” part is the mutate-and-select cycle. The “Pareto” part is the clever bit that keeps it from collapsing.

The naive move would be to always keep the single highest-scoring prompt and mutate that. The trap: you over-fit to whatever your best candidate happens to be good at, and you get stuck in a local optimum. GEPA instead maintains a Pareto frontier — a diverse set of candidates where different prompts win on different tasks or objectives (accuracy, conciseness, and so on). It then stochastically samples which candidate to evolve, weighted toward the ones that lead on the most tasks.

The payoff is that GEPA holds onto multiple “winning strategies” at once and can even do a system-aware merge — combining the strengths of two candidates that excel on different slices of the problem. That diversity is what lets it explore the discrete, awkward space of natural-language prompts without burning a huge budget or tunneling into a dead end.

What the numbers actually say

The headline results from the paper are genuinely strong:

GEPA reportedly beats GRPO by ~10% on average and up to ~20%, while using up to 35× fewer rollouts.
On HotpotQA, one writeup reports GEPA lifting a 42% baseline to 62% accuracy in ~6,400 rollouts, where GRPO needed ~24,000 rollouts to reach 43%.
It also outperforms the prior leading prompt optimizer, MIPROv2, by over 10%.

And the practical knock-on effect is the one teams actually care about: a well-optimized prompt on a small model can match or beat a larger frontier model. One Databricks-based demo reports optimizing a 20B model to reach the performance tier of much larger models — which, if it holds for your task, translates into dramatically cheaper inference. The GitHub project leans into this generality with a slogan worth remembering: if you can measure it, you can optimize it — prompts, code, agent configs, even scheduling policies.

When GEPA is the right tool — and when it isn’t

Here’s the part the hype tends to skip. GEPA is not a free win, and being honest about the boundaries is what separates a useful tool from a magic spell.

Reach for GEPA when:

Rollouts are expensive — slow agents, tool calls, scientific simulations. GEPA needs roughly 100–500 evals where RL might need 10,000+.
Data is scarce. It can work with as few as a handful of examples; you don’t need a big training set.
You only have API access. No weights required — you can optimize GPT, Claude, or Gemini straight through their APIs.
You value interpretability. The optimization trace is human-readable, so you can see why each prompt changed instead of trusting an opaque gradient.

Be skeptical when:

Your task is already near the model’s ceiling, or prompt wording isn’t the bottleneck. At least one independent study (on Verilog code generation) found GEPA-optimized prompts performing within the noise of a good hand-written chain-of-thought prompt — sometimes producing short, oddly incomplete-looking prompts that didn’t improve pass rates. More optimization isn’t always more performance.
Your evaluation metric is weak. GEPA optimizes exactly what you measure. If your metric is noisy or doesn’t capture what you actually want, the optimizer will happily evolve prompts that game it. The quality of your eval harness is the ceiling on the quality of your results.
You need to actually change model behavior at the weight level. Prompt optimization can’t teach genuinely new capabilities; for that, fine-tuning or RL still has a role. (The paper’s authors note the two are complementary — optimize prompts first, then fine-tune for further gains.)

The takeaway

GEPA reframes prompt optimization from an art into something closer to a measurable, automatable engineering process — and it does it by trusting language over scalars. The real insight isn’t “evolution beats RL”; it’s that the execution trace is a goldmine of diagnostic signal that traditional methods crush into a single number, and an LLM is now good enough to mine it.

If you’re running expensive agents, working with little data, or trying to get small-model economics out of a frontier-model task, it’s worth an afternoon. Just bring a serious evaluation metric — because GEPA will optimize precisely what you ask it to, and nothing more.

GEPA’s paper is on arXiv (2507.19457), the implementation lives at gepa-ai/gepa and as dspy.GEPA in DSPy, and there are solid hands-on walkthroughs from Pydantic and others. If you've run it in production, I'm most curious about the part that's hardest to write down: how much of your win came from GEPA versus from finally building a good eval set.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

GEPA: How to Let an LLM Rewrite Its Own Prompts (and When It Actually Helps)

Author(s): Samarth Banodia

GEPA: How to Let an LLM Rewrite Its Own Prompts (and When It Actually Helps)

The one-sentence version

The problem GEPA is reacting against

The core idea: language is a richer learning signal than a number

The “Pareto” half: how it avoids getting stuck

What the numbers actually say

When GEPA is the right tool — and when it isn’t

The takeaway

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

I Deleted Every Static Claude API Key I Owned. Here’s the Keyless Migration, Provider by Provider.

I Replaced ChatGPT With Local AI for 30 Days. Here’s What Actually Happened.

A Practical Guide to Evaluating a Cloud Migration Partner

AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async

Building Long-Running Claude Managed Agents: Why State Matters More Than Compute

The Building Blocks of LangGraph (Part 0)

Five Ways Claude Code Runs Multi-Step Work. The Two Questions That Pick the Right One.

Choose Wisely: Models Should Follow Your Use Case.

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

GEPA: How to Let an LLM Rewrite Its Own Prompts (and When It Actually Helps)

Author(s): Samarth Banodia

GEPA: How to Let an LLM Rewrite Its Own Prompts (and When It Actually Helps)

The one-sentence version

The problem GEPA is reacting against

The core idea: language is a richer learning signal than a number

The “Pareto” half: how it avoids getting stuck

What the numbers actually say

When GEPA is the right tool — and when it isn’t

The takeaway

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement