Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
What Nobody Tells You About Putting LLMs Inside Your Data Pipeline
Artificial Intelligence   Data Engineering   Latest   Machine Learning

What Nobody Tells You About Putting LLMs Inside Your Data Pipeline

Last Updated on June 3, 2026 by Editorial Team

Author(s): Sunil kumar Reddy

Originally published on Towards AI.

What Nobody Tells You About Putting LLMs Inside Your Data Pipeline

A practitioner’s honest account — written from financial data engineering — of what breaks, what surprises you, and what six months of production will teach you that no tutorial ever will.

When I first started wiring LLMs into our data pipelines, I spent three weeks debating which model to use. Spreadsheets, benchmarks, Slack threads at midnight. The whole thing. Everyone had a strong opinion. It felt like the most important decision in the room.

Six months of production later, I’d trade every hour of that debate for one honest article about what actually breaks. Because the model turned out to be maybe 20% of the problem. The other 80% is everything the tutorials don’t cover — data quality failures that become confident hallucinations, RAG pipelines that silently bypass your access controls, token costs that quietly triple before anyone notices, and a compliance team asking you to trace a model output back to its source data while you stare at your screen realising you have no way to do that.

So. Let’s talk about what happens after the first demo works.

The gap between tutorial and production

Every getting-started guide shows the same picture. Some data flows in. The LLM processes it. A clean answer comes out. Beautiful. Ship it.

Real production is something else entirely. The full architecture — the one that actually survives contact with real data, real users, and real compliance requirements — looks more like this:

What Nobody Tells You About Putting LLMs Inside Your Data Pipeline

Count the layers. There are six of them — and the model itself is just one node somewhere in the middle. Everything around it is what keeps the thing honest, safe, and debuggable in production. Let’s go through each problem, one at a time.

Problem 1: Your data quality issue becomes the model’s reasoning problem

This one took me longer to fully absorb than it should have. Classical data pipelines fail loudly. A dbt model with a null join key throws an error. A Great Expectations suite blocks a table from promoting to Silver. You know something is wrong. You fix it.

LLMs don’t work that way. Feed a language model corrupted or incomplete data and it doesn’t throw an exception. It reasons confidently over the garbage. It generates a plausible-sounding answer that is factually wrong in a way that’s completely invisible unless someone downstream already knows the correct answer. That’s a profoundly different failure mode — and it’s why data quality gates need to exist before and after the LLM call, not just at ingestion.

The specific checks I’d recommend:

Gate 1, at ingestion: schema validation, completeness thresholds, type assertions, and a PII scan. If any of these fail, route to a dead-letter queue. Do not pass go. Tools like Great Expectations or Soda Core handle this well, and if you’ve already built a Silver-layer quality suite in your lakehouse, most of those checks transfer directly.

Gate 2, right before the LLM call: this is the one people miss. After you’ve assembled your context window, check the token count against your model’s actual limit, score the retrieved chunks for relevance before including them, and run one more PII scan on the assembled prompt. I have personally seen pipelines that passed Gate 1 cleanly, assembled a 28,000-token context full of irrelevant retrieved documents, sent it to GPT-4o, and then spent three weeks wondering why the outputs had gotten so strange.

Problem 2: RAG pipelines secretly bypass your access controls

This is the governance problem that doesn’t get discussed nearly enough. And in a regulated environment like banking, it’s not academic. It’s potentially a breach.

Here’s the scenario. You’ve built a RAG system that retrieves chunks from your internal knowledge base to give the LLM context. You’ve got row-level security on your lakehouse tables. You’ve got Unity Catalog policies on your Gold layer. All of it locked down correctly.

And then a junior analyst types a question into your LLM-powered search tool that happens to trigger a retrieval of chunks containing customer account information. Chunks that were embedded from a document the analyst has no permission to view. The LLM helpfully surfaces the answer.

Row-level security doesn’t protect you here. The vector search doesn’t check permissions. Your retrieved chunks carry no access metadata by default. The entire governance stack you built for your structured data layer is invisible to the RAG retrieval path.

The fix is not simple but it’s necessary: your embedding pipeline needs to store access metadata alongside every vector. When a query comes in, filter vector search results by the permissions of the requesting user before passing anything to the LLM. Tools like Weaviate and Qdrant support metadata filtering natively. This diagram shows how the permissioned retrieval path works:

At embedding time, every document chunk gets tagged with the access roles permitted to see it. At query time, the vector search is filtered by the requesting user’s roles before anything reaches the context window. If nothing clears the filter, the pipeline returns a graceful “I don’t have information available for that query” rather than silently returning nothing or, worse, bypassing the check entirely. It takes more engineering to set up. It’s not optional.

Problem 3: Text-to-SQL is brilliant in demos and brittle in production

The demos are genuinely impressive. You type “what was our revenue by product category last quarter” and out comes a perfect, well-formatted SQL query. Everyone in the room wants to ship it immediately.

Then it hits production. Your schema has 340 tables. Twelve of them have a column called status. Three different tables have a customer_id that means three different things depending on which product line created the row. The column naming conventions evolved over seven years and two acquisitions. And the LLM, with no knowledge of any of this context, happily generates SQL that joins the wrong dimension and returns a number that is completely plausible but off by about 40%.

Nobody catches it for eleven days.

The core problem is that SQL generation quality degrades sharply with schema complexity and naming ambiguity. The fix requires three things: a curated schema layer exposed to the LLM (not the full schema, just the relevant tables with good descriptions), a feedback loop that stores validated query-result pairs as few-shot examples, and a post-generation SQL validation step that runs an EXPLAIN on every generated query before execution.

Tools like Vanna.AI implement the feedback loop pattern reasonably well. DSPY from Stanford takes a more principled approach of optimising the prompting chain using a small labelled dataset. Neither is magic. Both require ongoing maintenance as your schema evolves.

Problem 4: Token costs will surprise you and then they’ll keep surprising you

I’ve seen this pattern enough times to call it a rule. The prototype uses maybe £200 of API calls a month. You go to production. Within six weeks, someone’s built a dashboard that queries the LLM for every row load, someone else is embedding 50,000 documents nightly, and the monthly bill is £14,000. Nobody budgeted for that. Nobody even tracked it until the finance team emailed.

Download the Medium App

Token cost in a data pipeline isn’t a one-time concern. It’s an ongoing engineering discipline. A few things that actually move the needle:

Caching is underused. Identical or near-identical prompts get sent to the LLM dozens of times per hour in most pipeline patterns. Semantic caching, where you store recent prompt-response pairs and retrieve them by embedding similarity before making an API call, can cut costs by 30–60% in high-repetition workloads. GPTCache and LangChain's caching layer both handle this.

Context window hygiene matters more than people realise. Sending 15,000 tokens of context when 3,000 would have answered the question is expensive and it usually produces worse outputs, not better ones. Relevance scoring your retrieved chunks before including them is not a nice-to-have. It’s a cost control measure.

Model routing is worth considering earlier than you’d think. Not every query needs GPT-4o. A well-prompted gpt-4o-mini handles a large proportion of classification and extraction tasks at one-twentieth of the cost. Routing by query complexity, using a cheap classifier to decide which model to invoke, is a legitimate production pattern.

Problem 5: Hallucination is not a model problem — it’s a systems problem

There’s a tempting framing where hallucination is the model vendor’s problem to solve. A better model hallucinates less, so you upgrade, and eventually it goes away.

That’s not quite right. Hallucination is partly a function of model quality, yes. But it’s also a function of how much uncertain territory you’re asking the model to navigate, how you’ve structured your prompt, how relevant your retrieved context is, and how you’re validating outputs before they reach users. All of those are engineering decisions, not model parameters.

The practical guardrail stack I’d recommend for a production data pipeline:

Input-side: a system prompt with explicit boundaries on what the model is allowed to claim (“If you cannot answer from the provided context, say so. Do not speculate.”), plus a relevance threshold on your retrieved chunks so you don’t include weakly relevant documents that give the model something to confuse.

Output-side: a structured output parser that validates the response format before using it, a confidence scoring layer (asking the model to rate its own certainty, imperfect but useful as a filter), and for high-stakes outputs, a human review queue for responses below a confidence threshold.

None of this eliminates hallucination. It reduces it to a manageable rate and, crucially, it gives you visibility into when and where it’s happening.

Problem 6: Lineage for LLM outputs is harder than lineage for SQL — and just as important

SQL has decades of tooling for lineage. You can trace a column in a BI dashboard back through every dbt transformation to its source table in Bronze. That chain is explicit, deterministic, and auditable.

LLM outputs are none of those things. When your compliance team asks “what data produced this customer risk score generated by the LLM?”, the answer involves: which documents were retrieved by the RAG system, which version of the prompt template was in use, which model version was called, which features from the feature store were included in the context, and what the model’s training cutoff date was relative to the event being scored.

That’s a lot of things to log. And most teams aren’t logging them at all, because the pipeline works fine until the day it doesn’t, and then you desperately need the audit trail you never built.

The minimum viable LLM audit log needs to capture: the request ID, the user or system that triggered the call, the model and version, the full rendered prompt (or a hash of it, if size is a concern), the retrieved chunk IDs and their source document references, the raw model response, the parsed output, the confidence score if applicable, and the timestamp. That’s a wide row. Store it somewhere queryable — a Delta table on your lakehouse is fine. Index on request ID and timestamp. Keep it for as long as your data retention policy requires, which in financial services is typically seven years.

This is the architecture that ties everything together from a lineage perspective:

Three things the audit log enables that you can’t do without it: full compliance reconstruction (“show me exactly what the model saw when it made this decision”), model debugging (“why did this specific query produce a bad output last Tuesday”), and drift detection (“is the distribution of model outputs shifting over time in ways that suggest the underlying data has changed”).

The third one is particularly useful and underappreciated. If your average output confidence score drops from 0.87 to 0.71 over three weeks without any model change, that’s a signal your data has drifted in a way that’s making the model’s job harder. You want to know that before users start complaining.

What a real production setup looks like at modest scale

Let me make this concrete before I close. Say you’re a team of six data engineers at a financial services firm. Not a big tech company with 400 ML engineers. A real team with a real deadline.

Your pipeline probably looks something like this: Kafka for event ingestion, Delta Lake on Azure for your lakehouse, dbt for transformations, Airflow for orchestration. You want to add LLM capabilities because there is a genuine business need — say, automated classification of customer support tickets and a natural-language query layer on top of your reporting data.

The architecture I’d build: one RAG pipeline for the knowledge base queries (documents embedded into Azure AI Search with role-metadata filtering, retrieved chunks scored by BM25 plus vector similarity), one Text-to-SQL pipeline for structured reporting queries (schema metadata stored in a separate context table, Vanna for the few-shot feedback loop, EXPLAIN validation before execution), both feeding into a shared LLM call layer with retry logic, a 30-second timeout, and a fallback to a faster, cheaper model if the primary is slow.

The whole thing writes to a Delta audit table on every call. Great Expectations checks run nightly on the output distribution. A Grafana dashboard shows token cost per pipeline, average confidence by query type, and daily call volume by user team.

Is it perfect? No. You will hit schema drift you didn’t anticipate. The RAG retrieval will occasionally surface irrelevant documents that somehow passed the relevance filter. The Text-to-SQL will generate a join that works but returns subtly wrong numbers on one specific edge case that nobody notices for a week.

But it’s recoverable. Because you have the audit log to trace the problem, the observability to detect the drift, and the governance layer to make sure the LLM never sees data it shouldn’t. That’s the real goal. Not a perfect pipeline. A pipeline that fails detectably and fixes cheaply.

Final thought

The pattern I’ve seen across every organisation that does this well is simple. They treated the LLM as one component in a data system, not as the data system itself. They applied the same engineering discipline to the LLM layer that they applied to their dbt models and their Kafka consumers. They built observability before they needed it. They thought about governance before they had a compliance conversation that forced them to.

The model choice matters, eventually. But it’s probably the fifth most important decision you’ll make. The first four are data quality, access control, lineage, and observability. Get those right and the model almost doesn’t matter. Get them wrong and no model is good enough.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.