This companion piece is part of the Enterprise Document Intelligence series, which constructs an enterprise RAG system from four foundational components. It builds on Article 5 (document parsing) by focusing on a single table: image_df, which pinpoints every image embedded in the PDF without actually interpreting any of them. In this installment, we assemble the reading toolkit — a cascade of methods arranged by cost (a low-cost filter, a type check, traditional OCR, a vision model) — that converts the small number of images worth investing in into searchable text.
The parsing component hands you image_df: one entry per image found in the PDF, recording its page number, bounding box, dimensions, and a content hash. That tells you where every image sits. It says nothing about what any of them depicts. For retrieval purposes, that is equivalent to not having them at all — a bounding box isn’t something a user can query, and the image’s text field, where a description would normally go, remains blank.
The instinctive reaction is to feed every image into a vision model and call it done. That is the wrong default. A typical document is saturated with images that hold nothing a user would ever search for: the corporate logo repeated in every page header, a horizontal rule rendered as a 2-pixel-tall graphic, a bullet-point glyph, a decorative banner. Generating captions for those with a vision LLM means paying a model to describe a logo three hundred times over.
So the problem divides into two parts. First, the techniques for converting an image into text, along with the cost each one incurs: a low-cost filter, a type check, traditional OCR, a vision model. Second, which images are genuinely worth the investment in a given processing run. That second question is driven by context. A sentence in the body text reading “Figure 3 below illustrates…” is the signal to interpret that specific figure with a vision model while skipping its neighbors; the user’s actual question narrows the selection even further. This article establishes the techniques and demonstrates what each one produces, ordered by cost. Deciding which images to invest in, per document and per query, is the subject of adaptive parsing, covered separately in Article 10. Here, we build the toolbox.

1. Most images don’t justify a model call
The opening step costs nothing. Before any OCR or vision-model invocation, an inexpensive filter examines signals already present in image_df along with a handful of pixel-level statistics, and discards images that carry no retrieval value:
- Too small. An image whose shortest dimension is only a few dozen pixels, or whose total pixel area falls below a modest minimum, is an icon or a bullet — not a meaningful figure. A size threshold eliminates the bulk of these.
- The wrong shape. An image that is extremely long and extremely narrow is a rule or a divider, not substantive content. An aspect-ratio check catches these.
- Repeated everywhere. The same content hash appearing across most pages of the document signals chrome — a header logo, a footer emblem, a watermark. Tracking how many pages an image hash occurs on flags it as decoration rather than information.
is_worth_analyzing enforces these size and shape rules on each image individually, while flag_worth_analyzing first computes the per-page repeat frequency from the content hash, then appends a worth_analyzing column to image_df. Both functions reside in docintel.parsing.pdf.images. The thresholds are intentionally permissive: a false keep results in one unnecessary model call later, whereas a false drop loses content irretrievably, so when uncertain the filter retains the image. Flat, contentless images that are too large to be caught by the size test (a solid-color panel, for instance) are not forced through here; they are intercepted one step later, classified as decorative, and skipped in exactly the same way.
Input:
image_df(plus per-image pixel statistics). Output: the same table augmented with aworth_analyzingflag.
On a typical report, this step alone eliminates the vast majority of images before a single model is invoked. What remains is the small set that actually carries meaning.
2. What kind of image is it?
The images that pass through the filter are not all processed the same way. A screenshot of a table is text: traditional OCR reads it cheaply and accurately. A line chart is not text at all; its meaning resides in the axes and the trend, and only a vision model can articulate that in words. Sending the chart to OCR yields a handful of stray axis labels; sending the screenshot to a vision model incurs chart-level costs for something OCR handles for free.
So the second step assigns each retained image to a category:
decorative: a blank or near-uniform panel. Skip it.text: a screenshot, a scanned region, a table rendered as an image. Process with OCR.chart/diagram/photo: the meaning is visual. Process with a vision model.
classify_image returns one ImageType derived from inexpensive pixel signals: how much the pixels vary, how saturated they are, how much of the image consists of near-white background, and how dense its edges are. A near-uniform panel is classified as decorative. The test for this deserves a closer look, because the obvious approach doesn’t work: you cannot detect a blank panel by counting its colors. A real “all-black” or “all-white” region is never pixel-perfect — sensor noise and JPEG compression produce hundreds of near-identical color values, so a color count sails right past it. What stays near zero on a blank panel, noise included, is the dispersion of the pixel values — their standard deviation. Low dispersion means blank, regardless of the color count, so that is the signal used. Black ink on a white page — near-zero saturation with genuine stroke structure — is classified as text. A saturated, full-bleed image with no white margins is a photo. Everything else, every ambiguous case, falls through to chart.
Notice what is absent from that list: a step that decides “this looks like a logo.” That omission is deliberate, and it follows the same logic as the blank-panel case. A logo can be two flat colors, a black wordmark on white, or a full-color gradient with soft edges. Counting colors catches the first and misses the second — and worse, the two-color test also catches a bilevel scan of real text you actually wanted to read. Appearance alone does not identify a logo. Behavior does: a logo is chrome because it repeats, the same mark appearing in every page header. That signal was already evaluated, back in the filter, which discards any image whose content hash recurs across pages regardless of how many colors it contains.
A logo that appears just once, a mark on a cover page, doesn’t warrant special treatment; it gets processed like everything else, a wordmark handed off to free OCR, a graphic routed to a single vision call. The guiding principle is consistent: only skip content you’re certain is empty or decorative chrome, and read everything else, because an incorrect skip silently discards meaningful content.
That fallback to chart is the other critical design decision. Trying to distinguish a chart from a diagram from a photo using only low-level signals isn’t dependable, so the classifier doesn’t attempt to be overly sophisticated: it only routes an image to cheap OCR when it’s confident the image is clean monochrome text, and sends everything else to the vision model, which interprets charts, diagrams, photos, and any text they happen to contain. The asymmetry is deliberate. A missed OCR shortcut costs one vision call; running OCR on a diagram produces a handful of stray axis labels and gibberish. So when uncertain, the classifier pays for vision. Classification itself remains cheap, no model call needed, because it must cost less than the analysis it’s meant to avoid.
In: an image that passed the filter. Out: its
ImageType.
3. The cascade: the cheapest method that can read it
Type determines method. METHOD_BY_TYPE maps each type to one of three actions, ordered by cost, and describe_figure dispatches accordingly. The entire decision, for the cases you actually encounter in a document, fits in one table: what captures the image, what it costs, and what you get back.

Read it top to bottom and you’re reading the cascade in order. The first three rows never reach a model at all: the filter discards them based on size, shape, or repetition. The next row is caught by the classifier as a blank panel and skipped as well. Only the bottom five incur any cost, and of those only the genuine text-image takes the free path. The rest reach the vision model, which is precisely where your budget should be directed.
Watch out for sideways figures. A wide table or a landscape chart is often rotated 90 degrees to fit a portrait page. The rotation rarely appears where you’d first expect: the page’s rotation flag stays at 0, and the angle is embedded in the image’s own placement matrix instead. Rendered as-is, the figure reaches OCR or the vision model sideways, where OCR returns noise and a vision model interprets it with misplaced confidence and no indication it struggled. So the cascade reads the placement angle and counter-rotates the region before either method sees it: automatic, precise, no orientation-guessing. The one remaining case is a scan with the rotation baked into its pixels, with no matrix to read; there the OCR branch retries the quarter-turns and keeps the best-scoring result.
3.1. Skip: pay nothing for the noise
decorative: no call. A blank or near-uniform panel retains its empty text slot. Combined with the images the filter already discarded (the too-small, the wrong-shaped, the repeated chrome), this is where most images in a clean document end up, which is the whole point.
3.2. Classic OCR for text-images
text: a screenshot, a scanned table, a figure that is really rendered text. Classic OCR reads it locally, in milliseconds, for free. The series uses EasyOCR (docintel.parsing.pdf.easyocr); Tesseract is the other common option. OCR is precise on clean printed text and never fabricates words, which is exactly what you want when the image is text. Its companion article (Article 5 quinquies) covers OCR as a parser back-end in full; here it is one branch of the cascade.
The limitation is handwriting. A handwritten note looks like text to the classifier, but classic OCR is trained on print and interprets cursive as a string of guesses. The solution is to let OCR report its confidence. EasyOCR returns a confidence score with every line, so describe_figure reads the text and its mean confidence: a confident result is returned as-is, a low-confidence result is treated as a failed attempt and the image falls through to the vision model, which handles handwriting far better. The same path covers the rarer case where the classifier mistyped a non-text image as text. So the OCR branch is not “trust OCR blindly”; it is “try the free reader, keep its answer only when it’s sure, otherwise pay for vision.”
3.3. Vision LLM for charts, diagrams, and photos
chart, diagram, photo: the only images where the meaning is genuinely visual. A vision model looks at the picture and writes a short description, “a line chart of commodity prices since 2022, rising then flat after Q3”, “the Transformer architecture, an encoder of N stacked layers feeding a decoder”. That sentence is text, so retrieval can finally match it. This is the one thing no text-based parser can do, and it is the costliest step, so the entire cascade exists to ensure only these images reach it. The vision call itself goes through docintel.core.analyze_image, the single place every model call in the series lives (alongside llm_parse and llm_chat); the cost it carries is the subject of Article 5quater (vision reading).
The classifier already knows the type, so the prompt is tailored to it instead of a generic “describe this image.” A chart is asked for its axes, units, and trend; a diagram for its components and how they connect, with every label transcribed; a table rendered as an image is asked for its rows back as markdown; a photo for what it depicts. The right question elicits the right answer: ask a chart for its trend and you get the trend, ask it to “describe the image” and you get a sentence about colors. A caller can still pass an explicit prompt to override the type-specific ones, which is how a project-scoped or user-edited instruction flows through.
In: a typed image. Out: a short description, or
Nonefor a skip.
4. Writing the description back
The description is only useful if retrieval can find it. The image already has a line slot in line_df (an image sits at a position on the page, so it occupies a line, with an empty text cell, as covered in Article 5B (the relational data model)). The cascade writes its description into that cell. describe_image_df adds a description column to image_df, and the caller joins it back onto the image’s line.
The effect is that “the architecture diagram” or “the revenue chart” now retrieves the right page, through the same keyword and embedding path as any other line. Nothing downstream needs to know the text originated from a picture.
The enrichment process builds understanding in stages. Apply the cascade immediately after parsing a small corpus—or defer it, running it only on images a particular pipeline actually touches. The text slot starts blank until something populates it, and once filled, the contract stays unchanged: still one row per image, one line, one text value. When to populate that slot is a question this article hands off to adaptive parsing (Article 10): instead of reading every figure immediately, the lightweight text is read first, and a reference within that text (“Figure 3 below illustrates the performance gains”) triggers a vision call for the figure it references. The techniques described here are what that strategy invokes; the strategy itself is addressed in a later article.
The entire cascade is invoked in a single call. Pass it the image_df that parse_pdf produces and the original pdf_path; get back the same dataframe, now with the three new columns the cascade populated.
parsed = parse_pdf("data/paper/1706.03762v7.pdf") # image_df pinpoints each image
enriched = describe_image_df(parsed["image_df"], pdf_path="data/paper/1706.03762v7.pdf")
# describe_image_df appends three columns to image_df:
enriched[["page_num", "worth_analyzing", "image_type", "description", "prompt"]].head()
# worth_analyzing : the lightweight filter's decision (True/False)
# image_type : "decorative" | "text" | "chart" | "diagram" | "photo" | None
# description : searchable text written into the image's row slot
# prompt : the instruction sent to the vision model (None for OCR / skip)
This is also the cascade layer users can review and adjust directly. The screenshot below shows a desktop document viewer applying the same pipeline to NIST AI 100-1 (the AI Risk Management Framework, a U.S. Government work in the public domain): the Images tab lists all figures the parser detected, the highlighted diagram carries the gpt-4.1-generated description, and that description remains editable. Individual image controls let the user re-run OCR or override the result by forcing a vision call when the fast path produced an incorrect outcome.

5. Cost and latency: you pay per image, not per page
The cascade’s core principle is simple—spending should reflect actual value. The lightweight filter and the classifier execute on every retained image, yet they cost virtually nothing. OCR runs locally at no charge. The vision model—the sole line item with a real monetary and time cost—fires only on charts, diagrams, and photos, which in most enterprise documents represent a small share of total images and an even smaller share of total pages.
A blanket approach—captioning every image with a vision model—charges the same amount for a logo as for a complex chart, even though logos make up the majority of images. The cascade replaces that flat per-image vision expense with a filtering step, a low-cost classifier, and a vision call reserved for cases where nothing else can interpret the image. On a report with one logo per page and two meaningful figures, that means two vision calls instead of dozens.
No single image is ever billed twice. The filter already discards recurring chrome elements that appear on most pages, but a genuine figure may still span several pages (a reference diagram, a recurring exhibit). The cascade uses a content hash as its key, so a figure appearing on ten pages is analyzed once and its description is reused for the remaining nine. One image, one model invocation, regardless of how many times it is referenced.
6. Conclusion
image_df locates every image—it reads none of them. Image reading is a separate capability, and this article lays out its methods in order of increasing cost: discard the noise at no expense, classify what remains using cheap heuristics, extract readable text with local OCR, and reserve the vision model exclusively for charts and diagrams where the meaning is inherently visual. Each method places its output in the image’s text slot, turning every image into just another searchable line in the document. What this article intentionally leaves open is which images to analyze during a given run: exhaustively reading every figure upfront is rarely worthwhile, and the context-aware approach—letting the surrounding text and the analysis goal decide—is the subject of adaptive parsing (Article 10). First, the tools; next, the decision logic.
Sources and further reading
- Article 5 (parsing) and Article 5B (the relational tables) introduce
image_dfand the row slot the description is written into. - Article 5 quater (vision reading) covers the vision-LLM back-end and its costs.
- Article 5 quinquies (EasyOCR) covers classic OCR as a parser back-end.
- Article 10 (adaptive parsing) is where the deferred decision gets made: which images to analyze in a given run, escalating from cheap text to a vision call only where the context justifies it.
Earlier in the series:
- Document Intelligence: series intro. What the series constructs, module by module, and in what sequence.
- Baseline Enterprise RAG, from PDF to highlighted answer. The four-module pipeline from end to end: PDF in, highlighted answer out.
- Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity succeeds (synonyms, typos, paraphrase), where it predictably falters (unknown terms, negation, query-vs-answer relevance), and how to deploy it regardless.
- Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder contributes beyond bi-encoder embeddings, with measurements, and when it justifies the added latency.
- RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size tuning and fine-tuning target the wrong objective; route by question type instead.
- From regex to vision models: which RAG technique fits which problem. Two dimensions—document complexity and query control—that determine the right technique for each scenario.
- 10 common RAG mistakes we keep encountering in production. Ten production pitfalls, organized module by module, with the resolution for each.
- Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing module: the document’s character, its signals, and its summary.
- Stop returning flat text from a PDF: the relational structure RAG needs. The second half of the parsing module: the relational tables that every downstream module reads.



