As the saying goes, a single picture communicates more than a thousand words. Still, only a handful of corporate chatbots are capable of reliably returning images that are directly tied to their source material.
What’s the reason?
The answer is: although this would be a major improvement over a purely text-based experience, it’s challenging to execute with consistent reliability. Still, there’s no shortage of situations where this capability would be incredibly valuable. From real estate prospects wanting to view properties to technicians seeking the latest machine specifications, users would much rather receive precise, relevant images and maintenance tables directly within the response. Currently, the most we can manage is to provide answers with links pointing to source files (brochures, videos, manuals, and websites).
In this article, I’ll introduce an open-source MultiModal Proxy-Pointer RAG pipeline that makes this possible. It does so by treating documents as organized hierarchies of meaningful segments, rather than random word collections that get blindly broken into pieces for answering queries.
This follow builds on my earlier writings about Proxy-Pointer RAG, where I thoroughly examined the design principles and implementation details. Here’s what we’ll cover:
- Why is delivering multimodal replies challenging? What existing methods can be employed?
- How Proxy-Pointer achieves this with robust scalability and minimal expense through a text-exclusive pipeline — requiring no multimodal embeddings
- A live prototype with sample queries for your open-source repository to experiment with.
Let’s dive in.
Multimodality and Standard RAG
When it comes to multimodal RAG, it typically means you can search your knowledge base using images together with text queries. It seldom works in reverse. To understand why, let’s examine the common strategies typically employed:
Image Captioning
Process images through an OCR or Vision model, convert the image into a descriptive paragraph of text, and store it as a segment alongside other text. This method falls short, since the sliding window approach to chunking may cut image captions across different pieces.
The fundamental problem lies in a mismatch between what gets retrieved and what carries meaning. Traditional RAG systems pull out random segments, whereas significance—particularly for images—resides in coherent document sections.
When a fragment is fetched, the LLM may only encounter a partial caption (e.g., for Figure 5), making it tough to assess whether that image is truly relevant to this particular piece or to a neighboring one that wasn’t included. Furthermore, the response operator often gets multiple fragments from various documents without shared background, possibly containing multiple unrelated image captions. This makes it hard for the LLM to confidently decide which images, if any, truly matter for the user’s question.
Multimodal Embeddings
A different strategy is to encode both images and text into a common vector space using a multimodal model. While this permits cross-modal searching, it introduces its own set of issues. Multimodal embeddings prioritize similarity over precise placement. Visuals or layouts that look alike—such as financial tables from different companies—can seem nearly indistinguishable in vector space, even when just one truly matches the query.
Without understanding the structure of the document, the system fetches options based on how similar they appear but can’t confidently pinpoint which image genuinely belongs in the answer. Consequently, the LLM finds itself having to pick among several seemingly correct but potentially inaccurate visuals—often making it wiser to return nothing rather than risk showing the incorrect one.
Proxy-Pointer resolves this by swapping text-based fragmentation for a tree-structured approach. We don’t divide based on character limits; we divide along Sectional Boundaries. When a section includes 3 paragraphs and 2 images, no pieces extend beyond or spill into the next section. The LLM can evaluate each section as a fully independent unit and confidently assess the images within it.
Let’s see how this works in real scenarios.
Prototype Configuration
I created a Multimodal chatbot trained on 5 open research papers featuring CC-BY licenses: CLIP, Nemobot, GaLore, VectorFusion, and VectorPainter. Adobe PDF Extract API was used for extracting content from PDFs. As you might anticipate, these papers include dense text accompanied by a combined total of 270 images (illustrations, tables, formulas) that Adobe’s extraction was able to retrieve. The embedding model powering the system is gemini-embedding-001 (with dimensions trimmed down to 1536 from the default 3072, resulting in faster searches and lower memory requirements). This is an exclusively text-based embedding model—no multimodal model was incorporated. For all LLM operations (noise filtering, re-ranking, response generation, and final visual validation), gemini-3.1-flash-lite-preview serves as the engine. The underlying vector index is FAISS.
Multimodal Proxy-Pointer Structure
In my prior in-depth explorations, I shared proof that Proxy-Pointer RAG could reach 100% accuracy on financial 10-K reports by indexing “Strategic Pointers” (navigation markers like `Financials > Item 1A > Risk Factors`) instead of raw text fragments.
For multimodal outputs, we adjust the pipeline with the following premise — images (diagrams, tables, formulas, clips, and similar elements) can be pulled out as separate files (.jpg, .png, .svg, .mp4, etc.) and kept alongside the document content. This is fairly straightforward when the source is a webpage or XML. For PDFs, while not flawless, a tool like Adobe PDF Extract API, used in this case, can successfully pull out tables and figures.
Within the extracted document—in our case, Markdown format—each figure appears as a relative path for example:  embedded in the text, linking to the actual file. Here’s a practical example:
Furthermore, inspired by the Tangram puzzle which forms different objects using a set of basic elements, as illustrated in Fig. 2(b), we reframe the synthesis task as a rearrangement of a set of strokes extracted from the reference image.

"The Starry Night"

"Self-Portrait"
This leads us to the pivotal insight that Proxy-Pointer leverages: In practice, the LLM doesn’t actually need to view the image to gauge its relevance. It simply needs to know that an image exists within a particular document section. Because Proxy-Pointer retrieves entire sections—rather than disjointed fragments—the LLM can leverage the complete section context to evaluate whether it’s relevant. This transforms image selection into a judgment call rooted in the section’s overall meaning and the query, rather than an exploratory search driven by multimodal similarity matching.
This mirrors how people actually read. Scanning every table and figure isn’t the approach—instead, readers rely on section context and their specific question to determine which visuals deserve attention.
Below is the updated indexing pipeline:
Skeleton Tree: We continue to convert Markdown headings into a hierarchical structure using pure Python. The addition here is that each node now contains an embedded figures array that records every figure identified within that section, including its file path. These paths enable the system to fetch the corresponding images when needed. A sample node entry looks like this:
{
"title": "1 Introduction",
"node_id": "0003",
"line_num": 17,
"figures": [
{
"fig_id": "fig_1",
"filename": "figures/fileoutpart0.png"
},
{
"fig_id": "fig_2",
"filename": "tables/fileoutpart1.png"
}
]
},The following four steps remain largely unchanged from the previous approach:
Breadcrumb Injection: Each chunk gets prefixed with its complete structural path (such as Galore > 3. Methodology > 3.1. Zero Convolution) before the embedding process begins.
Structure-Guided Chunking: Text is divided only at section boundaries, ensuring no section gets split across separate chunks.
Noise Filtering: Unwanted segments (table of contents, glossary, executive summaries, references) are filtered out from the index using a large language model.
Pointer-Based Context: Retrieved chunks serve as pointers to reconstruct the full, intact document section—now including embedded image references—for the synthesizer.
Here is how the updated retrieval pipeline works for multimodal content:
Stage 1 (Broad Recall): FAISS identifies the top 200 chunks based on embedding similarity. Chunks sharing the same (doc_id, node_id) pair are merged, ensuring only distinct document sections appear in the final pool. This yields a refined set of approximately 50 candidate nodes. This process remains unchanged.
Stage 2 (Anchor-Aware Structural Re-Ranking): The re-ranker now receives each candidate’s full breadcrumb path along with a semantic snippet (150 characters). This enhancement addresses a key challenge: unlike financial reports or technical manuals, academic papers frequently use vague, non-descriptive headings (such as ‘3. Experiments,’ ‘4. Optimization,’ or ‘5. Comparison’). A brief semantic hint is therefore essential for the LLM to determine which of these ambiguous sections actually contains the specific performance metrics or compar
using latent score distillation
The VectorFusion pipeline generates SVGs through three primary stages:
Raster Sampling: A raster image is sampled from a pretrained text-to-image diffusion model (e.g., Stable Diffusion) using a Runge-Kutta solver. To ensure iconographic aesthetics, the input text is appended with a prompt suffix
..contd
Sources:
• VectorFusion > 4. Method: VectorFusion
• VectorFusion > 4. Method: VectorFusion > 4.1. A baseline: text-to-image-to-vector
• VectorFusion > 4. Method: VectorFusion > 4.2. Sampling vector graphics by optimization
AI-Verified Evidence List:
• VectorFusion - Figure 3: VectorFusion generates SVGs in three stages. (VectorFusion/figures/fileoutpart29.png)
• VectorFusion - Figure 5: Overview of VectorFusion’s latent score distillation optimization procedure. (VectorFusion/figures/fileoutpart33.png)Visual Question: What games are implemented in Nemobot, and how are they categorized according to Shannon’s game taxonomy?
Nemobot implements a variety of games categorized according to Shannon's game taxonomy, which classifies machines based on their decision-making logic:
Category 1: Dictionary-Based Machines
Tic Tac Toe: Uses a dictionary-based system where the LLM agent dynamically generates optimal moves for game states, effectively compressing the state-action dictionary into its inference capabilities.
..contd
Sources:
• NemoBot > II. GAME PLAYING WITH LARGE LANGUAGE MODELS > A. Shannon's Taxonomy of AI Games with LLMs
• NemoBot > IV. EVALUATION OF NEMOBOT GAMES > A. Dictionary-Based Game: Tic Tac Toe
• NemoBot > I. INTRODUCTION
AI-Verified Evidence List:
• NemoBot - Table I: Nemobot games categorized according to Shannon's game taxonomy (NemoBot/tables/fileoutpart5.png)
Edge Cases & Design Trade-offs
LLM Non-Determinism
Because an LLM handles image selection, running the same query multiple times might return slightly different images — even with temperature = 0.0. Based on your judgment, some results may seem more fitting than others.
Child-Node Figures
Targeted queries — such as "What is Vision Distillation Loss?" — tend to locate sections containing the exact concept and present the precise formula and figures far more effectively than broad queries like "Compare VectorFusion pipeline with VectorPainter". Broad queries usually pull in top-level section nodes, while the matching figures often live inside nested child nodes that fall outside the k=5 context window. Asking about each pipeline separately works well since all five available slots are devoted to a single paper, pulling in sufficient child nodes — and the associated figures — into context.
Detached Image Paths
This method assumes the image path (e.g., ``) actually exists within the retrieved section. If a figure is referenced in text but stored in a separate section (for instance, an Appendix) that isn’t included in the results, it won’t appear. A workable fix is to use descriptive image filenames — such as `table_1.jpg` or `figure_3.png` — so the synthesizer can build the path from the reference instead of depending on generic extractor names like `fileoutpart1.png`. Whatever method is chosen, the underlying principle stays the same: no multimodal embedding or visual interpretation is required. Complete section context is enough for the LLM to make smart image selections.
Open-Source Repository
Proxy-Pointer is fully open-source (MIT License) and accessible through its GitHub repository. The multimodal pipeline is being added alongside the existing text-only version in the same repo.
It’s built for a 5-minute quickstart:
MultiModal/
├── src/
│ ├── config.py # Model selection (Gemini 3.1 Flash Lite)
│ ├── agent/
│ │ └── mm_rag_bot.py # MultiModal RAG Logic
│ ├── indexing/
│ │ ├── md_tree_builder.py # Structure Tree generator
│ │ └── build_md_index.py # Vector index builder
│ └── extraction/
│ └── extract_pdf.py # Adobe PDF extraction to Markdown logic
├── data/ # Unified Data Hub
│ ├── extracted_papers/ # Processed Markdown & Figures
│ └── pdf/ # Original Source PDFs
├── results/ # Benchmarking Hub
│ ├── test_log.json # 20-query results & metrics
│ └── test_queries.json # Benchmark questions
├── app.py # Streamlit Multimodal UI
└── run_test_suite.py # Automated benchmark runnerKey Takeaways
- Multimodal RAG is primarily a retrieval alignment challenge, not a vision problem.
The core issue isn’t extracting or embedding images — it’s reliably linking them to the appropriate semantic context. - Chunk-based retrieval fragments visual coherence.
Sliding-window chunking breaks apart captions and separates images from their true semantic units, making dependable selection difficult. - Multimodal embeddings add ambiguity rather than clarity.
Visually similar elements — such as tables and diagrams — are hard to tell apart within a shared vector space, making relevance difficult to determine without structural anchors. - Document structure is the missing piece.
Treating documents as hierarchical semantic units lets images absorb meaning from their parent section, enabling confident selection. - Proxy-Pointer reframes the challenge.
Rather than searching for images directly, it retrieves sections and conditionally picks images based on full context — transforming a complex retrieval problem into a straightforward filtering task. - Accuracy is more critical for visuals than for text.
Displaying a wrong image can be more harmful than skipping one altogether, making precision essential for enterprise applications.
Conclusion
Multimodal responses have long been viewed as the next milestone in RAG system evolution. Yet, despite progress in vision models and multimodal embeddings, consistently returning relevant images alongside text remains an unresolved challenge.
The reason is subtle but foundational: conventional RAG pipelines work on fragmented chunks, while meaning — especially visual meaning — resides at the level of complete document structure. Without aligning retrieval to semantic units, even the most advanced models have difficulty making correct visual associations.
Proxy-Pointer MultiModal RAG tackles this gap by elevating the foundation from flat chunks to structured context. By pulling in complete sections and treating image paths as pointers to artifacts within them, it enables precise, scalable, and cost-effective multimodal responses —without depending on costly multimodal embeddings.
The outcome is a meaningful advancement: chatbots that don’t just explain, but show precise evidence — always anchored in the right context.
Clone the repo. Test it with your own documents. Let me know your thoughts.
Connect with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI
All research papers referenced in this article — CLIP, Nemobot, GaLore, VectorFusion, and VectorPainter — are available under CC-BY license. Code and benchmark results are open-source under the MIT License. Images in this article were generated using Google Gemini.



