companion in Enterprise Document Intelligence, the series that constructs an enterprise RAG system from four foundational components. Article 5 (document parsing) built the parser using PyMuPDF (fitz), which extracts the words from a page. This companion replaces the engine with a vision LLM that interprets the page as an image, delivering the words plus the one capability text-based parsers lack: the content of the visuals.
Present a PDF parser with a chart and it perceives an empty void. Whether native, cloud-hosted, or local, all text engines locate the words on a page and store them in searchable tables. A chart contains no words, so to every one of them that region is blank, and to a retrieval system it might as well not exist.
A vision model works differently. It examines the page the way a human would. Request the text and it returns the text and the tables, just like the rest. Show it a chart and it describes what the chart conveys, in plain language you can search. That final capability is what the others cannot offer.
The tradeoff: it runs slower, costs more, and extracts numbers from charts only approximately. It is also only as capable as the model you select. gpt-4.1 reads a chart that the less expensive gpt-4o-mini partially misses. So you don’t deploy it universally. You reserve it for pages that are predominantly images, where the other parsers return nothing useful.
1. The one thing only a vision model can do: make an image searchable
Begin with the reason this parser exists in the first place. The text-based engines convert a page into the relational tables from the earlier articles, but a figure defeats them: they return a chart as a bounding box in image_df with perhaps a stray axis label. A chart has no text, so to OCR and to a layout model that region is empty, and to a retrieval system it might as well not exist.

A vision model interprets the picture. Below are three figures extracted directly from two PDFs: the Transformer diagrams from Attention Is All You Need (Vaswani et al. 2017) and the commodity-price charts from the World Bank Commodity Markets Outlook (April 2026 issue). Each figure sits beside the one-sentence description gpt-4.1 generated for it. Source documents and licensing details are provided at the end of the article.

The price chart is now a sentence: commodity price indices by sector, declining since their 2022 peak. A user searching for “commodity price index since 2022” can now land on that page. Before, there was nothing on it to match against.
Here is the argument at its most pointed. Imagine a satellite photograph of a parking lot. It contains no text whatsoever. OCR finds nothing, layout finds one box, and to a retrieval system the image does not exist. A vision model writes “aerial view of a parking lot, roughly half full, around forty cars.” Now a search for parking occupancy finds it. That sentence is the parse, and only a vision model can produce it. OCR and layout cannot, by definition, because there were never any characters to read.
2. It also parses text and tables, like the others
The figure is the distinctive part, but a parser that only read pictures would be useless. A vision model reads the text and the tables too, and not worse than the text-based engines on clean material. We directed parse_page_vision at page 30 of the NIST Cybersecurity Framework, the Framework Core table, and requested markdown. It returned the table columns intact, merged cells handled (the Function name sits on the first row of its block and the continuation rows leave it blank).

This is the same cell structure Docling and Azure produce from the same page in the two previous articles: they emit markdown tables too, so the format is not what sets vision apart. The vision model never built a table object; it read the grid off the picture and wrote markdown (it returns HTML just as well). So the claim from the lead holds: it is a parser, returning the reusable model the others return, plus the figures they cannot.
3. The model matters: gpt-4o-mini misses charts that gpt-4.1 reads
How good the parse is depends heavily on the model, and the gap shows precisely where it counts: on the figures. We ran the same CMO chart page through gpt-4o-mini and gpt-4.1.

gpt-4o-mini found three of the six charts and labeled two of them as tables. gpt-4.1 found all six and transcribed their axes down to the month, including the policy-uncertainty and temperature-anomaly charts the smaller model missed. Both read the page text and the NIST table correctly. The weaker model fell down on the pictures, the one thing you brought vision in to do. So with this parser the model is part of the quality, not just a latency and cost knob: a cheaper vision model degrades gracefully on text and badly on figures.
4. The honest trade: exactness and cost
None of this is free, and the tradeoff is worth naming plainly. It is not that vision “isn’t really parsing,” because it is. It is that the parse is less exact and costs more per page.

Two costs stand out.
Exactness, with two faces: The values it reads off
Among all parsers covered in this series, vision-based parsing is the gentlest. This is its great advantage. A text-only engine fails silently when the PDF has little or no extractable text. The vision engine never evades the issue — it reads every page the same way, picture by picture, and either produces something useful or says clearly it found nothing. It does not pretend there is text when there is none.
Still, soft does not mean precise. The numbers a vision model reads from a curve chart are only approximate: the overall shape and general trend are reliable, but any individual data point may be slightly off. So treat every transcribed figure as a clue worth double-checking, not as confirmed truth. Worse still, a vision model can quietly leave something out — skip an entire row in a table, drop one chart from a panel of several — the way `gpt-4o-mini` dropped half the charts in section 3. That is a completeness problem, a type of hallucination by omission, and a deterministic parser never suffers from it: when Fitz or Docling reads a table, no row goes missing.
**Cost:** Each page is processed as a full image, triggering one model call, billed per page, with no bounding boxes provided afterward. Text-based parsers run once, cost almost nothing per page, and return exact character spans.
So the guiding principle is not “use vision instead of parsing.” It is “use vision only for the pages where text-based parsers come up empty.”
—
## 5. How it works: `parse_page_vision`
The mechanism is compact. The function renders the entire page as an image, sends that image to the vision model through the same `responses.parse` structured-output call used elsewhere in the series, and returns a small object: the page converted to markdown, plus a list of figures, each tagged with a `kind`, a `description`, and a `transcription`.
“`python
page = parse_page_vision(“CMO-April-2026.pdf”, 10, model=”gpt-4.1″)
page.markdown # headings, paragraphs, tables
page.figures # one entry per chart / diagram
page.figures[0].description # “line chart, price index …”
page.figures[0].transcription # axes, legend, readable values
“`
`parse_page_vision` sits alongside the `fitz`, `azure_layout`, and `docling` parsers because it *is* a parser. The adaptive-parsing dispatcher (Article 10) calls it whenever a page is visual enough that the text-based engines return empty-handed.
The body is short enough to read in one sitting. Two Pydantic models define the expected output: the page as markdown, plus one entry per figure with its kind, description, and transcription. The function renders the page to an image, attaches the prompt, and makes a single structured call via the shared `llm_parse` wrapper. Retries, token limits, and call caching are handled by the wrapper. There is no separate layout model and no OCR step — the model reads the pixels directly and populates the schema.
“`python
class FigureContent(BaseModel):
kind: str # chart, diagram, photo, map, …
description: str # what it shows, in searchable words
transcription: str # axes, legend, readable values
class VisionPageParse(BaseModel):
markdown: str # the page as markdown, tables kept
figures: list[FigureContent] # one entry per figure on the page
def parse_page_vision(pdf_path, page, *, client=None, model=None, zoom=2.0):
client = client or get_vision_client()
model = model or vision_model()
page_image = render_page_data_url(pdf_path, page, zoom=zoom)
content = [{“type”: “input_text”, “text”: “Parse this page.”},
{“type”: “input_image”, “image_url”: page_image}]
return llm_parse(
input=[{“role”: “system”, “content”: VISION_PARSE_SYSTEM_PROMPT},
{“role”: “user”, “content”: content}],
text_format=VisionPageParse, # the Pydantic contract above
client=client, model=model, label=”vision.parse_page”,
)
“`
The system prompt (`VISION_PARSE_SYSTEM_PROMPT`) is the other half of the engine: it instruct the model to preserve headings and reading order, output every table in markdown format, and add one entry per figure whose description is worded so someone could later *search* for it. Change that prompt and you reshape the parser itself.
—
## 6. The lighter mode: ask the page directly
There is an alternative, more lightweight use of the same capability. Instead of converting the whole page into a reusable structured object, just hand the model the page image together with a single question and get back a single answer. No markdown, no index, nothing stored. Hauling out the full parser would be overkill in these situations.
“`python
ans = answer_from_pdf_vision(
“data/nist/NIST.CSWP.04162018.pdf”,
“Category Unique Identifier for ‘Asset Management’?”,
pages=30,
)
ans.answer # “ID.AM”
ans.answer_found # True (False when not on the page)
“`
The behaviour is consistent, and the choice of model hardly matters here: both `gpt-4o-mini` and `gpt-4.1` produce the same answers. The NIST Framework Core lookup returned `ID.AM`, Function `Identify`; a question about Figure 1 of the Attention paper, legible from the diagram, came back correctly; and a question whose answer was absent from the page returned nothing.
That last row is just as important as the first two. A model inclined to read a page will fabricate a plausible-sounding answer unless the schema and the prompt explicitly give it a way to respond “not found.” Making that null-return path reliable makes the mode safe to use.
**Same idea, packaged.** The vision-as-parser pattern is now offered as an integrated product by several vendors. Mistral Document AI on Azure AI Foundry (model `mistral-document-ai-2512`, available as a serverless API in East US / East US 2 / Sweden Central) bundles an OCR component (`mistral-ocr-2512`) with a small reasoning model (`mistral-small-2506`) and returns markdown alongside a JSON object whose schema you can customise. The output contract differs from `parse_page_vision` — markdown instead of a `line_df`, structured extraction baked into the same call rather than deferred to a generation step. Same core idea, packaged for a per-page billing model. For pipelines that already operate in markdown or want the layout-plus-extraction step collapsed into a single API call, it is worth benchmarking against the OpenAI vision approach used in this article.
**The bbox gap is real.** Mistral OCR returns bounding boxes **only for images** embedded in the page (each image carries `top_left_x` / `top_left_y` / `bottom_right_x` / `bottom_right_y`). The markdown body itself has **no per-line, per-paragraph, or per-table-cell bboxes**. That breaks two things the rest of the series depends on: Article 1’s PDF annotation step (highlighting cited lines on the source PDF requires bboxes) and Article 7’s line-level retrieval audit (every retrieved row points back to its bbox so the reader can verify against the page).
**An open question for the reader, then.** How would you reconcile two parsers running on the same page — Mistral’s markdown
How do you merge two different parsing outputs—one from a structured but bbox-less source and another from a Docling-style line_df that includes bounding boxes but is flatter—into a single, unified result your downstream systems can use? Aligning two text streams at the line or token level is a notoriously difficult problem: segmentation strategies differ, OCR errors vary, and markdown’s table flattening discards cell positioning. This article does not offer a solution. If your downstream pipeline requires bbox-level traceability, the effort to reconcile these formats is significant and should be measured before committing to the markdown-based contract.
Sources for this section:
- Mistral OCR API endpoint specification, including input schema and response schema (pages with
markdown+imagesarray, where only image bounding boxes are provided). - Mistral OCR processor documentation (basic OCR),
table_formatparameter, and per-page response structure. - Mistral Document Annotations documentation, covering optional structured extraction with custom schemas and bbox-level annotations when explicitly requested.
- Mistral Document AI 2512 on Azure AI Foundry catalog, with serverless deployment available in East US / East US 2 / Sweden Central and billing calculated per page.
- Unlocking Document Understanding with Mistral Document AI in Microsoft Foundry, describing the bundled
mistral-ocr-2512+mistral-small-2506composition.
7. Four parsers now, one of them reads the pictures
All four engines are parsers. Three focus on text and structure; the fourth handles those as well—and also interprets the images embedded in the document.

Article 10 (adaptive parsing) introduces the dispatcher that selects the most suitable parser for each page. The vision parser occupies the visual end of the spectrum: use it when a page is dominated by charts, when a diagram contains the key information, when a scan is too degraded for reliable OCR, or when the content is purely an image with no text. It is the most costly per page and the least precise with numerical data, so it is invoked last. However, it is the only engine capable of transforming an image into retrievable content.
8. Conclusion
A vision model functions as a parser: request markdown, and it returns text and tables just like fitz or Azure; ask it to describe figures, and it delivers the one capability textual parsers lack—searchable descriptions of images. The trade-offs are real (lower precision, no bounding boxes, one model call per page), so the vision parser does not replace the textual ones—it fills their blind spot. The textual parsers read the words on the page; the vision parser reads the page that has no words.
9. Sources and further reading
Vision-language models used as document parsers stem from two lineages: the open VLM research community (PaliGemma, Florence-2, Qwen-VL family) and the leading multimodal APIs (OpenAI GPT-4o / GPT-4.1, Anthropic Claude with vision, Google Gemini). The most relevant references for this article are ColPali (Faysse et al. 2024), which treats the visual page image as the core retrieval unit, and the official model documentation pages where OpenAI details the vision capabilities of gpt-4.1 and gpt-4o-mini.
Same direction as the article:
- OpenAI, Vision capabilities of the gpt-4.1 family. Reference documentation for the model behind
parse_page_vision; follows the same architectural pattern (vision LLM as a parser that returns markdown or structured output). - Faysse et al., ColPali: Efficient Document Retrieval with Vision Language Models, 2024 (arXiv:2407.01449). Vision-language retrieval applied directly to page images. Anchors the visual row of the Article 4 diagnostic grid; applies the same family of techniques to a different component (retrieval rather than parsing).
Different angle, different context:
- Auer et al., Docling Technical Report, IBM Research 2024 (arXiv:2408.09869). Layout-driven parsing without a generative model. Offers a different cost-quality tradeoff: deterministic, low-cost, but unable to interpret figures. Article 5ter (Docling parsing) covers this engine in full.
- Microsoft, Azure AI Document Intelligence. Cloud-based parser with cell-level output. Shares Docling’s limitation with figures, but complements the vision LLM on every other content type.
Source documents and licensing. The figures and tables in this article are reproduced from openly-licensed sources:
Earlier in the series:
- Document Intelligence: series intro. An overview of what the series builds, component by component, and in what sequence.
- Baseline Enterprise RAG, from PDF to highlighted answer. The complete four-component pipeline end to end: PDF in, highlighted answer out.
- Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity excels (synonyms, typos, paraphrase), where it predictably fails (unknown terms, negation, term-vs-answer relevance), and how to work with it effectively.
- Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder adds beyond bi-encoder embeddings, measured empirically, and when the added latency is justified.
- RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and fine-tuning optimize the wrong objective; route by question type instead.
- From regex to vision models: which RAG technique fits which problem. Two axes—document complexity and question control—that guide technique selection for each scenario.
- 10 common RAG mistakes we keep seeing in production. Ten recurring production issues, organized component by component, with the fix for each.
- Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing component: the document’s nature, signals, and summary.
- Stop returning flat text from a PDF: the relational shape RAG needs. The second half of the parsing component: the relational tables every downstream component relies on.
- When PyMuPDF can’t see the table: parse PDFs for RAG with Azure Layout (link to come). The same tables extracted via Azure Layout: native table cells, OCR, and paragraph roles.
- Parse PDFs for RAG locally with Docling: rich tables, no cloud upload (link to come). The same tables generated locally with Docling: TableFormer cells, with nothing leaving the machine.



