Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 1: The Text Layer
Last Updated on June 25, 2026 by Editorial Team
Author(s): Ali Enver Arslan
Originally published on Towards AI.
Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 1: The Text Layer
Part 1 of a three-part series.
Some of the most useful documents in a bank are also the hardest for a machine to read: dense, chart-heavy PDFs where the numbers that matter are drawn inside the graphics rather than written as text. And in regulated banking you usually cannot ship those documents off to a third-party model API to make sense of them — the data has to stay in-house. That is the practical question behind this project: how well can a retrieval-augmented generation (RAG) [1] system read a document like this when it has to run entirely on local, open models, with no external endpoints at all?
A word on what this is and is not. It is a personal project, and the document is a public earnings deck from Akbank, the bank in Turkey where I work — exactly the kind of file we read every day. It is not a polished, production-grade system, and the scores are not meant to look impressive in absolute terms; they land around the middle of what these methods can do. That is rather the point. The aim is to map what is genuinely achievable offline — on a single workstation, with today’s open models — and to see which architectural choices actually move the needle. Better hardware and stronger local models would lift every number here; what carries over is the comparison between methods, not the absolute figures.
So the one firm rule was that everything runs locally: no external model endpoints anywhere in the system — just my own machine, an Ollama server, and the question of which architecture reads the document best. That is not an artificial constraint. It is the real environment many regulated enterprises work in, and rarely the one RAG gets benchmarked in.
What this series covers
Over three articles, I compare six RAG architectures on a single financial document, all measured with the same metrics and graded by the same local model:
Part 1 (this article) — the document, how it is parsed, the evaluation setup, and the first family of methods: the text-retrieval pipelines (Naive RAG, Hybrid RAG, and Hybrid RAG with a ColBERT reranker).
Part 2 — the structured and visual methods: Graph RAG, ColPali visual retrieval, and a vision pipeline that answers from the page images directly.
Part 3 — the full comparison across all six methods, broken down by question type (text, table, chart), and two findings: which kind of content only one of these approaches can recover, and where the standard metrics quietly mislead you.
The goal is not to crown a single winner and stop. It is to understand why each architecture behaves the way it does on a hard, real document, using a setup that someone under the same local-only constraint could reproduce.
Meet the document
The corpus is Akbank’s 4Q2025 earnings presentation — a public, 38-page deck. It is exactly the kind of document that makes RAG hard: not pages of plain text, but slides packed with waterfall charts, dense financial tables, and footnotes, where the actual numbers live inside the graphics.

For a RAG system, the first problem is not retrieval or generation. It is getting these numbers out of the page at all.
From PDF to Markdown
This deck is effectively image-only: the pages are rendered graphics, not a selectable text layer. You can see the consequence by asking a plain PDF parser to read a page. Here is `pymupdf` on the balance-sheet page:
import fitz
page = fitz.open("akbank_4q2025.pdf")[33] # the balance-sheet page
print(page.get_text("text"))
Consolidated (TL mn)Cash and due from BanksSecuritiesTL FX (USD)
Loans (net)TLFX (USD)OtherTotal AssetsDepositsTLFX (USD)Funds Borrowed an…
It returns the row labels and not a single number. Total Assets of 3,558,950? Gone. Out of ~130 characters on that page, every one is label text; the entire numeric grid is invisible, because it is painted into the image.

So extraction has to be OCR-based, not parse-based. The pipeline I settled on uses Docling [2] with EasyOCR, forcing full-page OCR and reconstructing tables with TableFormer [3] so a table comes out as a real Markdown grid rather than a stream of loose digits:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
opts = PdfPipelineOptions()
opts.do_ocr = True
opts.do_table_structure = True # TableFormer
opts.ocr_options = EasyOcrOptions(force_full_page_ocr=True)
opts.images_scale = 3.0 # higher DPI = sharper digits
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}
)
markdown = converter.convert("akbank_4q2025.pdf").document.export_to_markdown()

One honest caveat that matters later in the series: OCR recovers plain text and tables well, but charts are harder. The numbers on a waterfall chart are printed inside the drawing, and OCR catches them only as scattered, context-free labels, and it loses which value belongs to which bar. That gap between “tables, solved” and “charts, not really” is a theme I return to in Parts 2 and 3.
Every text pipeline below reads from the same Markdown produced here, so the only thing that varies between them is the retrieval architecture.
The evaluation setup
For evaluation I use a fixed set of 57 questions, each with a reference answer. The questions and reference answers were prepared independently of the systems being tested — read off the original document and validated by hand against the rendered pages — so the answer key cannot favor any one pipeline. Each question is tagged by where its answer lives: in plain text (20), in a table (19), or in a chart (18). A separate small set of deliberately unanswerable questions serves as a hallucination check.
Every pipeline is scored with RAGAS [5] on five metrics:
- Faithfulness — is the answer supported by the retrieved context?
- Answer relevancy — does the answer actually address the question
- Context precision — of what was retrieved, how much was relevant
- Context recall — of what was needed, how much was retrieved?
- Answer correctness — does the answer match the reference answer?
Answer correctness is the only one that compares against the independent reference answer, so it is the metric I trust most for “did it actually get the right answer.”
One caveat on faithfulness, since it usually carries real weight: a low score normally warns that the model is answering from its own prior knowledge instead of the retrieved text — a hallucination risk, even when the answer happens to be right. Here that risk is unusually small. The questions ask for exact, document-specific figures — a particular quarter’s capital ratio, one line on the balance sheet — that a model cannot plausibly invent, so getting them right essentially requires having read them off this document. That is why I lean on answer correctness and, as Part 3 will show, read faithfulness as a diagnostic signal rather than a verdict.
The judge is a local qwen3:14b (reasoning disabled, so it returns direct structured output), with nomic-embed-text [6] for the metric embeddings:
from ragas import evaluate
result = evaluate(
dataset, # question, answer, retrieved contexts, reference
metrics=[faithfulness, answer_relevancy,
context_precision, context_recall, answer_correctness],
llm=judge, # local qwen3:14b, reasoning off
embeddings=judge_embeddings, # nomic-embed-text
)
The rest of the local stack: Ollama for serving, gemma4 for answer generation, BAAI/bge-m3 [5] for dense embeddings, ChromaDB for the vector store. Everything ran on a single workstation under WSL2. With all of that fixed, the interesting question is what each architecture does with the same document, the same 57 questions, and the same judge.
Method 1 — Naive RAG (the baseline)
The simplest possible design, and the one every other method has to beat. Split the document into fixed-size chunks, embed each into a dense vector, store them. At query time, embed the question, retrieve the top-k chunks by cosine similarity, and pass them to the generator.

emb = embedder.encode([question]) # bge-m3
hits = collection.query(emb, n_results=10) # ChromaDB, cosine top-k
answer = llm.generate(RAG_PROMPT, context=hits) # gemma4, context-only prompt
Tables are kept whole as atomic chunks, so a retrieved table arrives intact rather than split across chunk boundaries. There is nothing clever beyond that, which is the point: it isolates how far plain dense similarity gets you.

Faithfulness is reasonable (it stays within what it retrieved), but answer correctness is the lowest of the three text methods. Dense similarity reliably finds chunks that are about the right topic; but on a financial deck the answer is often a specific number, and “semantically near the question” is not the same as “contains the exact figure the question asks for.”
Method 2 — Hybrid RAG
Dense embeddings capture meaning but can miss exact terms — a specific line item, an instrument name, a precise percentage. Lexical retrieval (BM25) [6] is the opposite: it matches exact tokens and ignores meaning. Hybrid retrieval runs both — plus bge-m3's sparse vectors — and merges the three result lists with Reciprocal Rank Fusion (RRF) [7], which combines candidates by rank position rather than raw score.

dense = dense_index.search(question, top_k) # bge-m3
bm25 = bm25_index.search(question, top_k) # lexical
sparse = sparse_index.search(question, top_k) # bge-m3 sparse
def rrf(result_lists, k=60):
scores = defaultdict(float)
for lst in result_lists:
for rank, doc_id in enumerate(lst):
scores[doc_id] += 1.0 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
ranked = rrf([dense, bm25, sparse])

Answer relevancy and correctness both improve over Naive RAG, because the lexical signal surfaces chunks that contain the exact terms a financial question hinges on. Notice, though, that the two context metrics dip slightly: fusing three retrievers widens the candidate pool, which lifts the quality of the final answer while also pulling in some off-target chunks. That gap between “better answers” and “noisier context” is worth holding onto — it returns in Part 3.
Method 3 — Hybrid RAG + ColBERT reranking
The previous two methods retrieve and answer in one shot. This one adds a reranking stage. Hybrid retrieval produces a candidate pool, and a ColBERT-style late-interaction reranker [8] then rescores every candidate against the query before the generator sees anything.
The difference is in how relevance is computed. A normal dense retriever compresses a whole chunk into one vector and compares it to one query vector. Late interaction keeps a vector per token and compares every query token to every document token, taking the best match for each (MaxSim). That captures fine-grained relevance (a chunk that answers one precise part of the question) that a single pooled vector smooths away.

candidates = hybrid_retrieve(question, k=20)
reranked = colbert.rerank(question, candidates) # late interaction, MaxSim
context = reranked[:top_k]
answer = llm.generate(RAG_PROMPT, context=context)

This is the strongest text pipeline. Reranking pushes the most answer-bearing chunk to the top, and both faithfulness (0.887) and answer correctness (0.544) lead the group. Note that this happens despite the lowest context precision and recall of the three — a strong hint that those context metrics, judged against the extracted text, do not fully capture what makes an answer correct. I come back to exactly this point in Part 3.
Where the text methods land

Within text retrieval, more sophistication clearly pays: adding a lexical signal and then a token-level reranker moves answer correctness from 0.415 to 0.544 on the same document, questions, and judge.
But all three top out well below what the document actually contains. A large part of this deck lives in tables and charts, and, as the extraction section already hinted, charts in particular do not survive the trip through text at all. In Part 2, I leave single-shot text retrieval behind and bring in three methods built for structure and for pixels: Graph RAG, ColPali visual retrieval, and a vision pipeline that answers straight from the page images.
References
[1] P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2005.11401.
[2] C. Auer et al., “Docling technical report,” arXiv:2408.09869, 2024.
[3] A. Nassar, N. Livathinos, M. Lysak, and P. Staar, “TableFormer: Table structure understanding with transformers,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022. arXiv:2203.01017.
[4] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Automated evaluation of retrieval augmented generation,” in Proc. 18th Conf. European Chapter of the ACL (EACL): System Demonstrations, 2024. arXiv:2309.15217.
[5] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “BGE M3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,” arXiv:2402.03216, 2024.
[6] S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
[7] G. V. Cormack, C. L. A. Clarke, and S. Büttcher, “Reciprocal rank fusion outperforms Condorcet and individual rank learning methods,” in Proc. 32nd Int. ACM SIGIR Conf., 2009, pp. 758–759.
[8] O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” in Proc. 43rd Int. ACM SIGIR Conf., 2020. arXiv:2004.12832.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.