Turning PDF Images Into Searchable RAG Assets—Without Reading Every Page

This companion piece is part of the Enterprise Document Intelligence series, which constructs an enterprise RAG system from four foundational components. It builds on Article 5 (document parsing) by focusing on a single table: image_df, which pinpoints every image embedded in the PDF without actually interpreting any of them. In this installment, we assemble the reading toolkit — a cascade of methods arranged by cost (a low-cost filter, a type check, traditional OCR, a vision model) — that converts the small number of images worth investing in into searchable text.

This companion’s position in the series: it extends Article 5 (document parsing), within Part II (the four bricks), handling the images the parser merely located — Image by author

The parsing component hands you image_df: one entry per image found in the PDF, recording its page number, bounding box, dimensions, and a content hash. That tells you where every image sits. It says nothing about what any of them depicts. For retrieval purposes, that is equivalent to not having them at all — a bounding box isn’t something a user can query, and the image’s text field, where a description would normally go, remains blank.

The instinctive reaction is to feed every image into a vision model and call it done. That is the wrong default. A typical document is saturated with images that hold nothing a user would ever search for: the corporate logo repeated in every page header, a horizontal rule rendered as a 2-pixel-tall graphic, a bullet-point glyph, a decorative banner. Generating captions for those with a vision LLM means paying a model to describe a logo three hundred times over.

So the problem divides into two parts. First, the techniques for converting an image into text, along with the cost each one incurs: a low-cost filter, a type check, traditional OCR, a vision model. Second, which images are genuinely worth the investment in a given processing run. That second question is driven by context. A sentence in the body text reading “Figure 3 below illustrates…” is the signal to interpret that specific figure with a vision model while skipping its neighbors; the user’s actual question narrows the selection even further. This article establishes the techniques and demonstrates what each one produces, ordered by cost. Deciding which images to invest in, per document and per query, is the subject of adaptive parsing, covered separately in Article 10. Here, we build the toolbox.

*One extracted image goes in, a searchable description comes out — using the cheapest method capable of reading it — Image by author*

1. Most images don’t justify a model call

The opening step costs nothing. Before any OCR or vision-model invocation, an inexpensive filter examines signals already present in image_df along with a handful of pixel-level statistics, and discards images that carry no retrieval value:

Too small. An image whose shortest dimension is only a few dozen pixels, or whose total pixel area falls below a modest minimum, is an icon or a bullet — not a meaningful figure. A size threshold eliminates the bulk of these.
The wrong shape. An image that is extremely long and extremely narrow is a rule or a divider, not substantive content. An aspect-ratio check catches these.
Repeated everywhere. The same content hash appearing across most pages of the document signals chrome — a header logo, a footer emblem, a watermark. Tracking how many pages an image hash occurs on flags it as decoration rather than information.

is_worth_analyzing enforces these size and shape rules on each image individually, while flag_worth_analyzing first computes the per-page repeat frequency from the content hash, then appends a worth_analyzing column to image_df. Both functions reside in docintel.parsing.pdf.images. The thresholds are intentionally permissive: a false keep results in one unnecessary model call later, whereas a false drop loses content irretrievably, so when uncertain the filter retains the image. Flat, contentless images that are too large to be caught by the size test (a solid-color panel, for instance) are not forced through here; they are intercepted one step later, classified as decorative, and skipped in exactly the same way.

Input: image_df (plus per-image pixel statistics). Output: the same table augmented with a worth_analyzing flag.

On a typical report, this step alone eliminates the vast majority of images before a single model is invoked. What remains is the small set that actually carries meaning.

2. What kind of image is it?

The images that pass through the filter are not all processed the same way. A screenshot of a table is text: traditional OCR reads it cheaply and accurately. A line chart is not text at all; its meaning resides in the axes and the trend, and only a vision model can articulate that in words. Sending the chart to OCR yields a handful of stray axis labels; sending the screenshot to a vision model incurs chart-level costs for something OCR handles for free.

So the second step assigns each retained image to a category:

decorative: a blank or near-uniform panel. Skip it.
text: a screenshot, a scanned region, a table rendered as an image. Process with OCR.
chart / diagram / photo: the meaning is visual. Process with a vision model.

classify_image returns one ImageType derived from inexpensive pixel signals: how much the pixels vary, how saturated they are, how much of the image consists of near-white background, and how dense its edges are. A near-uniform panel is classified as decorative. The test for this deserves a closer look, because the obvious approach doesn’t work: you cannot detect a blank panel by counting its colors. A real “all-black” or “all-white” region is never pixel-perfect — sensor noise and JPEG compression produce hundreds of near-identical color values, so a color count sails right past it. What stays near zero on a blank panel, noise included, is the dispersion of the pixel values — their standard deviation. Low dispersion means blank, regardless of the color count, so that is the signal used. Black ink on a white page — near-zero saturation with genuine stroke structure — is classified as text. A saturated, full-bleed image with no white margins is a photo. Everything else, every ambiguous case, falls through to chart.

Notice what is absent from that list: a step that decides “this looks like a logo.” That omission is deliberate, and it follows the same logic as the blank-panel case. A logo can be two flat colors, a black wordmark on white, or a full-color gradient with soft edges. Counting colors catches the first and misses the second — and worse, the two-color test also catches a bilevel scan of real text you actually wanted to read. Appearance alone does not identify a logo. Behavior does: a logo is chrome because it repeats, the same mark appearing in every page header. That signal was already evaluated, back in the filter, which discards any image whose content hash recurs across pages regardless of how many colors it contains.

A logo that appears just once, a mark on a cover page, doesn’t warrant special treatment; it gets processed like everything else, a wordmark handed off to free OCR, a graphic routed to a single vision call. The guiding principle is consistent: only skip content you’re certain is empty or decorative chrome, and read everything else, because an incorrect skip silently discards meaningful content.

That fallback to chart is the other critical design decision. Trying to distinguish a chart from a diagram from a photo using only low-level signals isn’t dependable, so the classifier doesn’t attempt to be overly sophisticated: it only routes an image to cheap OCR when it’s confident the image is clean monochrome text, and sends everything else to the vision model, which interprets charts, diagrams, photos, and any text they happen to contain. The asymmetry is deliberate. A missed OCR shortcut costs one vision call; running OCR on a diagram produces a handful of stray axis labels and gibberish. So when uncertain, the classifier pays for vision. Classification itself remains cheap, no model call needed, because it must cost less than the analysis it’s meant to avoid.

In: an image that passed the filter. Out: its ImageType.

3. The cascade: the cheapest method that can read it

Type determines method. METHOD_BY_TYPE maps each type to one of three actions, ordered by cost, and describe_figure dispatches accordingly. The entire decision, for the cases you actually encounter in a document, fits in one table: what captures the image, what it costs, and what you get back.

*The cascade decision for every image type you encounter in a real document, from free to paid – Image by author*

Read it top to bottom and you’re reading the cascade in order. The first three rows never reach a model at all: the filter discards them based on size, shape, or repetition. The next row is caught by the classifier as a blank panel and skipped as well. Only the bottom five incur any cost, and of those only the genuine text-image takes the free path. The rest reach the vision model, which is precisely where your budget should be directed.

Watch out for sideways figures. A wide table or a landscape chart is often rotated 90 degrees to fit a portrait page. The rotation rarely appears where you’d first expect: the page’s rotation flag stays at 0, and the angle is embedded in the image’s own placement matrix instead. Rendered as-is, the figure reaches OCR or the vision model sideways, where OCR returns noise and a vision model interprets it with misplaced confidence and no indication it struggled. So the cascade reads the placement angle and counter-rotates the region before either method sees it: automatic, precise, no orientation-guessing. The one remaining case is a scan with the rotation baked into its pixels, with no matrix to read; there the OCR branch retries the quarter-turns and keeps the best-scoring result.

3.1. Skip: pay nothing for the noise

decorative: no call. A blank or near-uniform panel retains its empty text slot. Combined with the images the filter already discarded (the too-small, the wrong-shaped, the repeated chrome), this is where most images in a clean document end up, which is the whole point.

3.2. Classic OCR for text-images

text: a screenshot, a scanned table, a figure that is really rendered text. Classic OCR reads it locally, in milliseconds, for free. The series uses EasyOCR (docintel.parsing.pdf.easyocr); Tesseract is the other common option. OCR is precise on clean printed text and never fabricates words, which is exactly what you want when the image is text. Its companion article (Article 5 quinquies) covers OCR as a parser back-end in full; here it is one branch of the cascade.

The limitation is handwriting. A handwritten note looks like text to the classifier, but classic OCR is trained on print and interprets cursive as a string of guesses. The solution is to let OCR report its confidence. EasyOCR returns a confidence score with every line, so describe_figure reads the text and its mean confidence: a confident result is returned as-is, a low-confidence result is treated as a failed attempt and the image falls through to the vision model, which handles handwriting far better. The same path covers the rarer case where the classifier mistyped a non-text image as text. So the OCR branch is not “trust OCR blindly”; it is “try the free reader, keep its answer only when it’s sure, otherwise pay for vision.”

3.3. Vision LLM for charts, diagrams, and photos

chart, diagram, photo: the only images where the meaning is genuinely visual. A vision model looks at the picture and writes a short description, “a line chart of commodity prices since 2022, rising then flat after Q3”, “the Transformer architecture, an encoder of N stacked layers feeding a decoder”. That sentence is text, so retrieval can finally match it. This is the one thing no text-based parser can do, and it is the costliest step, so the entire cascade exists to ensure only these images reach it. The vision call itself goes through docintel.core.analyze_image, the single place every model call in the series lives (alongside llm_parse and llm_chat); the cost it carries is the subject of Article 5quater (vision reading).

The classifier already knows the type, so the prompt is tailored to it instead of a generic “describe this image.” A chart is asked for its axes, units, and trend; a diagram for its components and how they connect, with every label transcribed; a table rendered as an image is asked for its rows back as markdown; a photo for what it depicts. The right question elicits the right answer: ask a chart for its trend and you get the trend, ask it to “describe the image” and you get a sentence about colors. A caller can still pass an explicit prompt to override the type-specific ones, which is how a project-scoped or user-edited instruction flows through.

In: a typed image. Out: a short description, or None for a skip.

4. Writing the description back

The description is only useful if retrieval can find it. The image already has a line slot in line_df (an image sits at a position on the page, so it occupies a line, with an empty text cell, as covered in Article 5B (the relational data model)). The cascade writes its description into that cell. describe_image_df adds a description column to image_df, and the caller joins it back onto the image’s line.

The effect is that “the architecture diagram” or “the revenue chart” now retrieves the right page, through the same keyword and embedding path as any other line. Nothing downstream needs to know the text originated from a picture.

The enrichment process builds understanding in stages. Apply the cascade immediately after parsing a small corpus—or defer it, running it only on images a particular pipeline actually touches. The text slot starts blank until something populates it, and once filled, the contract stays unchanged: still one row per image, one line, one text value. When to populate that slot is a question this article hands off to adaptive parsing (Article 10): instead of reading every figure immediately, the lightweight text is read first, and a reference within that text (“Figure 3 below illustrates the performance gains”) triggers a vision call for the figure it references. The techniques described here are what that strategy invokes; the strategy itself is addressed in a later article.

The entire cascade is invoked in a single call. Pass it the image_df that parse_pdf produces and the original pdf_path; get back the same dataframe, now with the three new columns the cascade populated.

parsed = parse_pdf("data/paper/1706.03762v7.pdf")   # image_df pinpoints each image
enriched = describe_image_df(parsed["image_df"], pdf_path="data/paper/1706.03762v7.pdf")

# describe_image_df appends three columns to image_df:
enriched[["page_num", "worth_analyzing", "image_type", "description", "prompt"]].head()
# worth_analyzing : the lightweight filter's decision     (True/False)
# image_type      : "decorative" | "text" | "chart" | "diagram" | "photo" | None
# description     : searchable text written into the image's row slot
# prompt          : the instruction sent to the vision model (None for OCR / skip)

This is also the cascade layer users can review and adjust directly. The screenshot below shows a desktop document viewer applying the same pipeline to NIST AI 100-1 (the AI Risk Management Framework, a U.S. Government work in the public domain): the Images tab lists all figures the parser detected, the highlighted diagram carries the gpt-4.1-generated description, and that description remains editable. Individual image controls let the user re-run OCR or override the result by forcing a vision call when the fast path produced an incorrect outcome.

*the cascade made visible to the user: every detected figure, its description inserted into the document model, and per-image controls to re-run OCR or force a vision pass – Image by author*

5. Cost and latency: you pay per image, not per page

The cascade’s core principle is simple—spending should reflect actual value. The lightweight filter and the classifier execute on every retained image, yet they cost virtually nothing. OCR runs locally at no charge. The vision model—the sole line item with a real monetary and time cost—fires only on charts, diagrams, and photos, which in most enterprise documents represent a small share of total images and an even smaller share of total pages.

A blanket approach—captioning every image with a vision model—charges the same amount for a logo as for a complex chart, even though logos make up the majority of images. The cascade replaces that flat per-image vision expense with a filtering step, a low-cost classifier, and a vision call reserved for cases where nothing else can interpret the image. On a report with one logo per page and two meaningful figures, that means two vision calls instead of dozens.

No single image is ever billed twice. The filter already discards recurring chrome elements that appear on most pages, but a genuine figure may still span several pages (a reference diagram, a recurring exhibit). The cascade uses a content hash as its key, so a figure appearing on ten pages is analyzed once and its description is reused for the remaining nine. One image, one model invocation, regardless of how many times it is referenced.

6. Conclusion

image_df locates every image—it reads none of them. Image reading is a separate capability, and this article lays out its methods in order of increasing cost: discard the noise at no expense, classify what remains using cheap heuristics, extract readable text with local OCR, and reserve the vision model exclusively for charts and diagrams where the meaning is inherently visual. Each method places its output in the image’s text slot, turning every image into just another searchable line in the document. What this article intentionally leaves open is which images to analyze during a given run: exhaustively reading every figure upfront is rarely worthwhile, and the context-aware approach—letting the surrounding text and the analysis goal decide—is the subject of adaptive parsing (Article 10). First, the tools; next, the decision logic.

Sources and further reading

Article 5 (parsing) and Article 5B (the relational tables) introduce image_df and the row slot the description is written into.
Article 5 quater (vision reading) covers the vision-LLM back-end and its costs.
Article 5 quinquies (EasyOCR) covers classic OCR as a parser back-end.
Article 10 (adaptive parsing) is where the deferred decision gets made: which images to analyze in a given run, escalating from cheap text to a vision call only where the context justifies it.

Earlier in the series:

Document Intelligence: series intro. What the series constructs, module by module, and in what sequence.
Baseline Enterprise RAG, from PDF to highlighted answer. The four-module pipeline from end to end: PDF in, highlighted answer out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity succeeds (synonyms, typos, paraphrase), where it predictably falters (unknown terms, negation, query-vs-answer relevance), and how to deploy it regardless.
Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder contributes beyond bi-encoder embeddings, with measurements, and when it justifies the added latency.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size tuning and fine-tuning target the wrong objective; route by question type instead.
From regex to vision models: which RAG technique fits which problem. Two dimensions—document complexity and query control—that determine the right technique for each scenario.
10 common RAG mistakes we keep encountering in production. Ten production pitfalls, organized module by module, with the resolution for each.
Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing module: the document’s character, its signals, and its summary.
Stop returning flat text from a PDF: the relational structure RAG needs. The second half of the parsing module: the relational tables that every downstream module reads.

Top Posts

Crawlee in Python: Architecting an Intelligent Web Crawling Pipeline with Robotic Compliance, Link Graph Mapping, and RAG-Ready Chunk Export

Transform Your Team: Why Business Training Software Is the Key to a Future-Ready Workforce

The 5 Biggest Publicly Traded Companies Holding SOL on Their Balance Sheets

Turning PDF Images Into Searchable RAG Assets—Without Reading Every Page

I Put the New Modular ThinkPad to the Test — and It’s the Repairable Future I’ve Been Waiting For

e2e-assure Unveils Cumulo: The U.K.’s First Sovereign AI-Powered Zero-Day SOC for Unified IT and OT Defense

Yandex Launches YaFF: Zero-Copy Wire Format Matching Struct-Speed Protobuf Reads

The I’ve Tested Dozen of Vacuum Cleaners? These Are the Only 2024 Early Prime Day Robot Vacuum Deals Worth Your Money

SAP and Google Cloud Unveil Agentic Commerce Architecture

Building a Custom GStreamer Plugin for NVIDIA DeepStream

Crawlee in Python: Architecting an Intelligent Web Crawling Pipeline with Robotic Compliance, Link Graph Mapping, and RAG-Ready Chunk Export

Transform Your Team: Why Business Training Software Is the Key to a Future-Ready Workforce

The 5 Biggest Publicly Traded Companies Holding SOL on Their Balance Sheets

Innovator Spotlight: Centrii – Redefining Cyber Defense in the Digital Age

I Put the New Modular ThinkPad to the Test — and It’s the Repairable Future I’ve Been Waiting For

Lightning-Fast Lake Views in Microsoft Fabric: When Your Medallion Architecture Fits in a Single SELECT

Cloud Exchange 2026: NEA’s Jim Tunnessen on Rapid IT Deployment Strategies

e2e-assure Unveils Cumulo: The U.K.’s First Sovereign AI-Powered Zero-Day SOC for Unified IT and OT Defense

Trending

Crawlee in Python: Architecting an Intelligent Web Crawling Pipeline with Robotic Compliance, Link Graph Mapping, and RAG-Ready Chunk Export

Transform Your Team: Why Business Training Software Is the Key to a Future-Ready Workforce

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Turning PDF Images Into Searchable RAG Assets—Without Reading Every Page

1. Most images don’t justify a model call

2. What kind of image is it?

3. The cascade: the cheapest method that can read it

3.1. Skip: pay nothing for the noise

3.2. Classic OCR for text-images

3.3. Vision LLM for charts, diagrams, and photos

4. Writing the description back

5. Cost and latency: you pay per image, not per page

6. Conclusion

Sources and further reading

Related Posts