Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 2: Escaping the Text Layer
Last Updated on June 25, 2026 by Editorial Team
Author(s): Ali Enver Arslan
Originally published on Towards AI.
Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 2: Escaping the Text Layer
Part 2 of a three-part series. Part 1 covered the document, the extraction, the evaluation setup, and the text-retrieval methods.
In Part 1, three text-retrieval pipelines climbed from 0.415 to 0.544 answer correctness on the same 57 questions. More sophisticated retrieval helped, but all three hit a ceiling, and it was the same ceiling for all of them: they read a text layer extracted from an image-only deck, and that layer loses the document’s charts entirely.
So the obvious next move is to stop accepting that ceiling. This article is three escalating attempts to get past it:
- Restructure the text into a knowledge graph (Graph RAG).
- Make retrieval visual — find the right page by sight, not by text (ColPali).
- Make generation visual — let the model read the page as an image (Vision RAG).
Same document, same questions, same local judge as Part 1. As a reminder, the metric I trust most is answer correctness, the only one that compares against the independent reference answer.
Method 4 — Graph RAG
Graph RAG [1] attacks the problem by reorganizing the text rather than re-reading it. An LLM reads each chunk and extracts the entities and the relations between them. Those become a knowledge graph; nearby nodes are grouped into communities, and each community gets a short summary. At query time, the pipeline finds the entities mentioned in the question, walks the graph to the chunks around them, adds the relevant community summaries, and generates from all of it.

It is built for relational, multi-hop questions — “which exposures connect this subsidiary to that risk factor” — where the answer is assembled by following links. Worth saying plainly up front: a quarterly earnings deck rarely asks those. Most of its questions are direct lookups of a number on a page. That mismatch shows up in the scores.

This is the weakest pipeline in the study, and the mismatch is structural. A knowledge graph is built for relational, multi-hop questions — following links between named concepts — but a quarterly results deck mostly asks for direct lookups of specific numbers, and those numbers sit in dense financial tables with little narrative to connect. Entity-and-relation extraction handles the deck’s plain text well, picking up figures stated in sentences (USD 238 mn, TL 681 bn, and the like), but the tables give graph traversal little it can act on. The upshot is the lowest answer correctness in the study, from a pipeline that — lacking the figure in its context — often answers "that figure isn't in the provided context."
Which is exactly why it posts the highest faithfulness in the study (0.902) while its correctness is the lowest. A pipeline that honestly declines to answer is, technically, almost perfectly faithful. It never claims anything its context doesn’t support. Hold that thought: faithfulness can reward caution as much as accuracy. It comes back in Part 3.
Visual retrieval: two ways to use it
Graph RAG still reads the lossy text layer; it just rearranges it first. The next idea drops text from the retrieval step altogether: retrieve pages by sight.
ColPali [2] is a vision-language retrieval model. Instead of embedding extracted text, it renders each page to an image and encodes it as many small patch vectors. A query is scored against a page with MaxSim [3]: every query token is matched to its best-fitting patch, and the scores are summed. This is late interaction at the level of the page: a question about “net fee and commission income” lands on the region of the page that visually holds that line, even when OCR mangled the same table into noise. The model is matching words to places on the page, not words to words, and no text extraction is involved in finding the page at all.
Once retrieval is visual, there are two choices for what the generator reads — and that fork is the whole point of this section:

Both branches retrieve the same pages the same way. They differ only in the input the generator sees: the extracted text, or the page image.
Method 5 — ColPali RAG (visual retrieval, text generation)
The first branch: retrieve pages with ColPali, but generate from the extracted text of those pages, with gemma4.
# index: each page rendered to an image, encoded to multi-vector patches
page_vectors = colpali.encode_images(page_images) # vidore/colpali
# query: score the question against every page, MaxSim over patches
q = colpali.encode_query(question)
pages = maxsim_topk(q, page_vectors, k=5) # whole pages, by sight
context = [page_text[p] for p in pages] # then read the text
answer = gemma4.generate(RAG_PROMPT, context=context)

Reading pages as images, ColPali “sees” layout and finds the right page more cleanly than any text retriever. But look at correctness: 0.528, essentially tied with the best text pipeline from Part 1. The retrieval got better; the generation didn’t, because the generator is still reading the same lossy text. Good retrieval, lossy generation. That gap is the argument for the next branch.
Method 6 — Vision RAG (visual retrieval, visual generation)
The second branch changes the one thing ColPali left untouched: it lets the generator read the image, not the text.
pages = maxsim_topk(q, page_vectors, k=5) # same retrieval as ColPali
images = [page_image[p] for p in pages]
# the generator reads the page image, not the extracted text
answer = vlm.generate(VISION_PROMPT, image=images[0]) # qwen3-vl:8b
Two things mattered for making this work, and both are about the generator, not the retrieval. First, the model: I screened three local vision-language models — llama3.2-vision, gemma4, and qwen3-vl:8b — and qwen3-vl [4] read fine print most reliably. The differences were not subtle. Weaker vision models tend to narrate: they describe the chart, add caveats, and occasionally derive a number instead of reading it. Every extra clause is a claim that can be wrong, and on a strict reading most of them are unsupported, so a chatty-but-correct answer still scores badly.
Which is why the second lever — the prompt — matters as much as the model. A strict instruction keeps the model doing the one thing it is good at, reading:
Answer with a single fact. No preamble, no explanation.
For a change or comparison, report BOTH period values; do NOT compute the difference.
If the value is not in the image, reply exactly: NOT_IN_IMAGE
The middle rule earns its place. Asked how a figure moved between two quarters, a vision model will happily subtract and report a delta, and one small arithmetic slip turns a correct read into a wrong answer. Telling it to report both printed values, and nothing else, keeps the arithmetic out of it.
With qwen3-vl:8b and that prompt, Vision RAG becomes the most accurate pipeline in the study. I ran a second variant with gemma4 as the vision model — same retrieval, same prompt, weaker vision — as a controlled comparison:

Note the context recall (0.596) is identical to ColPali’s: all three pipelines retrieve the same pages, so the retrieval-side metrics should match, and recall does. (Context precision wobbles a little — 0.525 for ColPali, 0.563 for Vision RAG — but with identical retrieved pages that gap is just noise from the LLM judge, not a real difference; it is an early hint that these context metrics are shakier than they look, which Part 3 takes up in full.) So the jump in correctness, from ColPali’s 0.528 to Vision RAG’s 0.823, is purely a generation effect. Nothing about retrieval changed.
Why such a jump? Because the text layer never had the answer in the first place. Here is page 11 of the deck — two charts dense with numbers:

And here is what extraction recovers from it:
## Margin recovery set to continue<!-- image -->## NIM recovery restarted in 3Q25 and continued in 4Q25- Margin recovery underpinned by positive trajectory in TL spread...
The text bullets survive. Both charts — the NIM waterfall (250, +194, … 310) and the quarterly swap-cost and CPI-linker bars (15,872, −4,998, … −13,608) — collapse into bare <!-- image --> placeholders. Every number in them is gone. So when a question asks for 4Q25 NIM, a text pipeline has the surrounding sentences but not the value; the vision model reads 310 straight off the bar. That is the whole finding in miniature, and Part 3 quantifies it by question type.
One more thing to carry forward, and it is the opposite of what Graph RAG showed. Vision RAG has the highest correctness but a low faithfulness (0.538), among the lowest in the study. Graph RAG was faithful by saying almost nothing; Vision RAG looks unfaithful while being right. Both can’t be straightforward facts about answer quality. Hold both thoughts.
Where this leaves us
Restructuring the text didn’t help (Graph RAG, 0.238): you can’t reorganize numbers the text layer never captured. Making retrieval visual found better pages but couldn’t fix the generation (ColPali, 0.528). Making generation visual broke the ceiling: Vision RAG reads the chart itself and reaches 0.823, well past the best text method’s 0.544.
But two of the numbers above don’t behave the way a single “quality” score should: Graph RAG is the most faithful pipeline and one of the worst, while the most correct pipeline is one of the least faithful. In Part 3, I put all six methods side by side, break the results down by question type (text, table, and chart — where the vision gap is widest), and show why those two faithfulness signals are pointing at the same thing: the standard metrics were not built for a pipeline that reads the page instead of the text.
References
[1] D. Edge et al., “From local to global: A graph RAG approach to query-focused summarization,” arXiv:2404.16130, 2024.
[2] M. Faysse et al., “ColPali: Efficient document retrieval with vision language models,” arXiv:2407.01449, 2024.
[3] O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” in Proc. 43rd Int. ACM SIGIR Conf., 2020. arXiv:2004.12832.
[4] P. Wang et al., “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution,” arXiv:2409.12191, 2024.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.