In my piece on Taming Entity and Relationship Sprawl in Knowledge Graphs, I covered how the Proxy-Pointer design helps pinpoint the right entities and relationships quickly. Still, that tackles just the second half of a major hurdle in loading data into a graph. The trickier—and costlier—task is actually spotting those entities (NER) and relationships to begin with.
Knowledge Graphs are designed to handle layered queries and aggregation across entities and relationships found in similar documents—contracts, compliance guides, credit terms, global policies, and more. These files often stretch beyond 100 pages, with text that easily surpasses 500,000 characters. Companies routinely upload thousands of nearly identical contracts from vendors and clients.
To build the graph, each document goes through a powerful LLM for NER and relationship extraction, burning through millions of tokens before the graph-loading stage even begins. The whole process sometimes needs repeating since extracting from very long documents can lead to inconsistent results and fluctuating outputs.
Yet there’s a key detail: legal documents like contracts tend to follow a consistent layout regardless of the organization or industry. They are also loaded with repetitive boilerplate, schedules, and attachments, most of which offer little for entity recognition but still have to be processed by an LLM.
So, what if we could take advantage of this predictability? What if we could judge a section’s worth before sending it to the LLM and cut processing costs by simply skipping the noise?
In this article, we’ll look at a new way to limit what the LLM actually reads. By applying the Proxy-Pointer RAG framework and a new metric called Graphability Indexing, we can intentionally skip the low-value parts of dense documents. I’ll demonstrate this approach using three large, real-world corporate Credit Agreements—from Emerson, AT&T, and Texas Roadhouse—to show how this method significantly reduces extraction costs compared to whole-document processing, while still keeping the Knowledge Graph intact.
Quick Recap: Proxy-Pointer Fundamentals
Proxy-Pointer is an advanced RAG method that offers precise handling of intricate documents, such as annual reports and credit agreements, at no extra cost over standard Vector RAG. Traditional vector RAG breaks files into blind pieces, creates embeddings, and pulls the top matches by cosine similarity. Even with overlapping or smarter slicing, this isn’t reliable for relationship extraction in enterprise knowledge graphs because splitting breaks up context and increases the risk of the model making things up.
Proxy-Pointer, on the other hand, views a document as a hierarchy of self-contained semantic blocks (sections). Each block retains its own context, making it ideal for relationship extraction. A LLM is far more accurate at identifying entities and relationships from a focused section in one go than from an entire 100-page file, so repeated passes are usually unnecessary.
The technique uses five zero-cost engineering improvements—a skeleton tree of the document’s structure, breadcrumb tagging, structure-guided chunking, noise filtering, and pointer-based context. We’ll build on several of these ideas here, plus introduce a few new ones. You can find more about Proxy-Pointer in the linked article.
What Others Do to Optimize NER
Before diving into the Proxy-Pointer method, let’s review common optimization strategies used today.
- Classic NLP / Pre-Trained Tools (like spaCy): Many teams start with fast, inexpensive NLP pipelines such as spaCy paired with an LLM in a funnel setup. These tools are quick, trained to find typical entities (people, organizations, locations, dates), and used to flag entity-heavy regions. Then, only those areas get a closer scan by the LLM. But entity-packed sections don’t always mean important relationships. For example, boilerplate parts like ‘Notices’ or end ‘Exhibits’ may have plenty of standard entities (names, addresses, dates) without any useful legal ties.
- These models also struggle with specialized corporate terms (like Adjusted Term SOFR or Swing Line Loans) and can’t easily pull out the complex, layered relationships needed for strict legal Knowledge Graphs. Constant tweaking of these models also demands heavy manual labeling and processing power.
- LLM Pre-Screening with Smaller Models: Another option is using a cheaper LLM to skim chunks and judge if they contain worthwhile relationships, then passing only those high-value chunks to a larger reasoning model for detailed extraction. It’s less expensive per token, but you’re still making a model read every word of a 500,000-character document, meaning much of the document gets scanned twice in vain.
The Proxy-Pointer Method
As noted, Proxy-Pointer relies on these features of knowledge graphs:
- Graphs serve a specific domain and hold similar content. A procurement graph loads multiple supplier agreements (often duplicates from the same vendor), while a finance graph holds lending contracts, credit terms, and compliance files.
- These documents follow a common layout—sections, schedules, addenda—and only a portion holds valuable entities and relationships. The difficulty lies in pinpointing those parts.
We use this document predictability in these steps:
- Create and apply a basic Graphability index: Establish a reference index for a certain document type (like Credit Agreements). Sections are rated as very high, high, medium, low, or very low graphability. The score is based on Relational Density—the number of meaningful business links (edges) relative to the section’s length—instead of just counting entities (nodes). This prevents sections loaded with names and dates but lacking key relationships, such as Notices or Exhibits, from getting a high score. Using this method,
payment of obligationsgets a very high rank, whileDuties of AgentorGoverning laware considered low value. There’s an important exception, though. Even though most sections are scored on relational density, core ontological parts like ‘Subsidiaries’ are tagged as ‘Very High’ because their few links define the crucial company hierarchy that governs the rest of the contract. This keeps the index useful as a business-focused guide, not just a technical count of entities.
- Structure tree creation: We create a structure tree of a document that lists the hierarchy of sections as nodes, along with section title.
- Enrich and Adjust: We navigate through the tree rather than the text itself. We use the initial set of documents to refine and strengthen the index. Each section’s content is identified using line numbers. The section titles help locate the predicted yield index. Next, the LLM reviews all sections of the document, evaluating the actual yield index for every section based on extracted connections and entities. Any mismatches between the expected and real ratings are flagged for manual evaluation (for example, when the actual rating is “Low” but the index predicted “Medium”). The index classifications are updated according to feedback from human subject-matter experts.
- Route and Bypass: Once the above procedure is complete, we derive a thoroughly enhanced graphability index after reviewing several documents. From that point, high-yield sections (Very High, High, Medium) are sent to the LLM for thorough NER extraction. Low and Very Low sections are safely skipped.
- New Sections: Every document will include some sections absent from the index, which are flagged as Coverage Gaps. These require mandatory NER scans to avoid overlooking relevant connections. Once reviewed by human evaluators, commonly occurring sections can be added to the index, while distinctive sections such as
Benchmark Replacement Settingcan be disregarded. - Reach stabilization. After only a handful of iterations, we anticipate prediction mismatches to decrease to nearly zero, and the number of “New Sections” to level off at around 20-25% (reflecting highly specialized or routine clauses), allowing the system to process vast document collections with a reliable balance of thoroughness and efficiency.
The graphability index should be maintained for each document type and may even need to be customized for specific large suppliers and partners, from whom we could receive hundreds of similar documents in a year.
Let’s now examine how this operates in a practical experiment.
The Experimental Setup
To validate this approach, I set up an experiment using three lengthy, publicly available Corporate Credit Agreements that I referenced earlier in my article on efficient Contract Comparison through Proxy-Pointer. These agreements originate from distinct companies (and industries), so their structure and formatting differ across documents.
- Emerson Electric Co. (~228,000 characters)
- AT&T Inc. (~214,000 characters)
- Texas Roadhouse, Inc. (TRoadhouse) (~434,000 characters)
Baseline Graphability Index
Our objective is to develop and iteratively confirm a predictive Graphability Index. We begin with a foundational baseline index mapping typical credit agreement sections to their anticipated relational density:
{
"document_type": "credit_agreement",
"very_high_graphability": [
"Litigation",
"Environmental Matters",
"Subsidiaries",
"Payment of Obligations",
"Maintenance of Property",
"Mergers and Sales of Assets",
"Commitment Schedule",
"Sanctions and Anti-Corruption",
"Designation of Subsidiary Borrowers",
"Definitions",
"Events of Default",
"Successors and Assigns"
],
"high_graphability": [
"Company Guarantee",
"The Facility",
"Facility Letters of Credit",
"Corporate Existence and Power",
"Corporate Authorization",
"Financial Information",
"Compliance with Laws",
"Use of Proceeds",
"Arranger and Syndication Agent",
"Eurocurrency Payment Offices",
"Defaulting Lenders"
],
"medium_graphability": [
"Swing Line Loans",
"Competitive Bid Advances",
"Credit Extensions",
"Designation of a Subsidiary Borrower",
"Successor Agent",
"Funding Indemnification",
"Acceleration and Collateral Accounts",
"Collateral"
],
"low_graphability": [
"Accounting Terms",
"Interest Rate Changes",
"Method of Payment",
"Telephonic Notices",
"Market Disruption",
"Judgment Currency",
"Change in Circumstances",
"Confidentiality"
],
"very_low_graphability": [
"No Waivers",
"Counterparts and Integration",
"Governing Law",
"Waiver of Jury Trial",
"No Fiduciary Duty",
"Service of Process",
"Miscellaneous",
"Electronic Communications",
"Exhibit",
"Table of Contents"
]
}The process is divided into three phases. First, the Emerson agreement is processed to determine the initial savings. Any uncovered general sections (deltas) identified in Emerson are incorporated into the index. Then the updated index is applied to AT&T, with any final edge cases added to the index if needed. Finally, the fully refined index is tested against the large TRoadhouse agreement to measure the total improvement. The objective is that by the time the TRoadhouse agreement is reviewed, mismatches should be considerably lower than in the prior two, as the index stabilizes.
Evaluation Criteria
For every section, we compare the index-predicted graphability with the actual rating determined by the LLM based on the relations and entities found. In the report, results are organized into three groups:
Perfect Alignment: The index accurately forecast the section’s graphability rating.
Minor Deviations: The index predicted a yield (e.g., Medium) that slightly varied from the manual assessment (e.g., Low).
Coverage Gaps / New Sections: The section was unique to the document and not yet included in our predictive index.
Results & Iterative Enrichment
Let’s proceed with Phase 1 — Emerson
Phase 1: Emerson Credit Agreement (Testing the Baseline)
We processed the 95 sections of this agreement using the baseline index. In this initial run, 66 out of 95 sections (70.0%) matched perfectly. The index accurately mapped standard provisions like “Mergers and Sales of Assets” as highly graphable, while correctly labeling “Accounting Terms” and standard boilerplate Exhibits as low-yield. No mismatches occurred between actual and predicted ratings from the index.
However, 29 sections (~30%) were flagged as New Sections, identified as Coverage Gaps. Upon review, while many were highly specialized administrative clauses (such as “Ratable advances” and “Notification of advances”) and appropriately left as gaps, several standard sections (like “Types of Advances,” “Compliance with ERISA,” and “Interest Payment Dates; Interest and Fee Basis”) needed to be added to the index. Based on their assessed actual yield, I added these specific clauses to the “Medium” and “Low” tiers of the graphability index, enriching the baseline for the subsequent phase.
A key finding is that even with this initial baseline index, 36,880 characters of text classified as “Low” or “Very Low” yield were correctly identified as noise by the index. As a result, skipping these sections and not sending them to the LLM for processing can lead to a 16.10% decrease in the overall LLM processing workload.
The following data highlights the match quality and the efficiency of yield prediction:
| Matched Ratings | Number of Sections | Total Characters | % of Total Document |
|---|---|---|---|
| Very High | 13 | 61,360 | 26.79% |
| High | 13 | 83,040 | 36.26% |
| Medium | 17 | 27,840 | 12.16% |
| Low | 15 | 12,800 | 5.59% |
| Very Low | 8 | 24,080 | 10.51% |
| Mismatched Rating | 0 | 0 | 0.00% |
| New Section | 29 | 19,920 | 8.70% |
| TOTAL | 95 | 229,040 | 100.00% |
Here is a small sample of rows from the base table used for section-by-section comparison:
Node ID Section Header Approx. Chars Entities (Est.) Relations (Est.) Actual Rating Predicted Rating (Index Match) Match Quality
0002 Section 1.01 Definitions 44,400 252 402 Very High Very High (Definitions) 🟢
0003 Section 1.02 Accounting Terms and Determinations 320 4 4 Low Low (Accounting Terms) 🟢
0004 Section 1.03 Types of Advances 800 19 2 Low New Section ⚪
0006 Section 2.01 The Facility 2,320 27 21 High High (The Facility) 🟢
0007 Section 2.02 Ratable Advances 3,840 56 19 Very High New Section ⚪The following are a few examples of extractions:
- **Company Guarantee (Very High)**:
- *Entities*: Guarantor, Agent, Obligations
- *Relations*: [Guarantor]-(guarantees)->[Obligations], [Guarantor]-(indemnifies)->[Agent]
- **Mergers and Sales of Assets (Very High)**:
- *Entities*: Borrower, Assets, Buyer
- *Relations*: [Borrower]-(sells)->[Assets], [Borrower]-(merges_with)->[Buyer]
- **Ratable Advances (Very High)**:
- *Entities*: Advance, Lender, Borrower
- *Relations*: [Lender]-(makes)->[Advance], [Borrower]-(receives)->[Advance]
- **Method of Payment (Low)**:
- *Entities*: Agent, Accounts, Funds
- *Relations*: None (section contains purely administrative procedures with minimal relevant relational links)Phase 2: AT&T Credit Agreement (Refinement Stage)
Next, we applied the enhanced index to the AT&T Credit Agreement. The document consists of 77 sections with approximately 214,000 characters in total.
The results showed notable progress. 55 out of 77 sections (71.4%) achieved Perfect Alignment, nearly identical to the Emerson results. There were also 4 mismatched sections where the actual and predicted graphability ratings differed, accounting for only 5%. No adjustments were made to the index to prevent overfitting to individual documents. A reduction in Coverage Gaps was observed as only 18 sections (23.4%) had them, improving from Emerson’s 30%. All such sections were judged to be procedural noise from a knowledge graph perspective—examples include calculations of time periods, extensions of termination dates, or subordination clauses. These are considered low-yield segments in NER terms and should be excluded from future LLM scanning. However, to verify the robustness of the current index, these were not included during testing against the TRoadhouse document.
The potential savings in LLM usage grew significantly. Since the index could reliably detect extensive parts of the document as low-yield content—such as interest rate calculations, increased costs clauses, along with Table of Contents and trailing Exhibits—the system marked 72,763 characters as unnecessary for scanning. Implementing this approach in production could deliver a 33.94% reduction in processing requirements, while still ensuring all high-value relational information is captured.
The match quality and efficiency results are summarized below:
| Matched Ratings | Number of Sections | Total Characters | % of Total Document |
|---|---|---|---|
| Very High | 5 | 53,520 | 24.96% |
| High | 9 | 41,840 | 19.51% |
| Medium | 15 | 20,000 | 9.33% |
| Low | 12 | 10,960 | 5.11% |
| Very Low | 14 | 61,803 | 28.83% |
| Mismatched Rating | 4 | 4,880 | 2.28% |
| New Section | 18 | 21,397 | 9.98% |
| TOTAL | 77 | 214,400 | 100.00% |
A small excerpt from the section rating analysis table is shown below:
Node ID Section Header Approx. Chars Entities (Est.) Relations (Est.) Actual Rating Predicted Rating (Index Match) Match Quality
0017 SECTION 2.12. Payments and Computations 1,520 21 5 Low Low (Payments and Computations) 🟢
0018 SECTION 2.13. Taxes 3,360 14 10 Medium Medium (Taxes) 🟢
0019 SECTION 2.14. Sharing of Payments, Etc. 800 8 6 Low Low (Sharing of Payments) 🟢
0020 SECTION 2.15. Evidence of Debt 640 10 2 Low Low (Evidence of Debt) 🟢
0021 SECTION 2.16. Use of Proceeds 320 8 4 High High (Use of Proceeds) 🟢
0022 SECTION 2.17. Increase in the Aggregate Commitments 2,800 22 9 Medium New Section ⚪
0023 SECTION 2.18. Extension of Termination Date 3,120 20 25 Medium New Section ⚪
0024 SECTION 2.20. Replacement of Lenders 1,920 19 12 Medium Medium (Replacement of Lenders) 🟢
0025 SECTION 2.21. Benchmark Replacement Setting 12,560 61 31 High High (Benchmark Replacement Setting) 🟢Here are a selection of extraction examples:
- **Certain Defined Terms (Very High)**:
- *Entities*: Base Rate, Margin, SOFR
- *Relations*: IS_A, PART_OF, CONTROLS, ROLE_OF, REFERENCES (defining these terms establishes the core ontology, supports standard entity normalization, and enhances semantic structure)
- **Conditions Precedent (Medium)**:
- *Entities*: Closing Date, Certificates, Approvals
- *Relations*: [Lender]-(requires)->[Certificates], [Agent]-(receives)->[Approvals]
- **Accounting Terms; Interpretive Provisions (Low)**:
- *Entities*: GAAP, Accounting Principles
- *Relations*: None (this section consists entirely of administrative and interpretive content, with little to no significant relational data)Phase 3: TRoadhouse Credit Agreement (Final Validation)
Although only the first document was used to build and refine the graphability index, let’s now validate it against the TRoadhouse credit agreement. Before proceeding, it is important to note that
It’s important to consider several differences, not just between the documents themselves, but also across the domain and industry. Emerson and AT&T are large, blue-chip utility and telecommunications providers, while Texas Roadhouse is a mid-sized restaurant chain. The agreements for Emerson and AT&T read like sovereign corporate treasury documents shaped by credit agency ratings, whereas Texas Roadhouse’s agreement is heavily customized and built specifically around restaurant lease arrangements. In terms of size, this document contains around 434,000 characters, making it nearly as large as the previous two combined, with more than 100 sections in its structure tree.
Put simply, if the Graphability Index performs well on this document, the idea that document structure can reliably predict the yield of entities and relationships will be proven beyond any doubt.
And here are the results — the index delivered outstanding performance. 81 out of 102 sections (79.4%) matched the index’s predictions exactly. There were zero cases where the actual rating diverged from the predicted one. The model perfectly identified high-yield sections such as “Letters of Credit” and standard “Affirmative and Negative Covenants,” which should trigger full extraction. The remaining 21 sections (20.6%), categorized as coverage gaps, consisted of low-yield administrative clauses (e.g., Rounding, Erroneous Payments) and procedural noise (e.g., Divisions, Commitments).
However, the real value emerged in payload efficiency. Several additional low-yield sections — including Accounting Terms, Rounding, Administrative Agent, and Miscellaneous — were identified beyond the Exhibits. The Schedules were evaluated individually based on their content value. While certain schedules like Liens and Investments matched the index’s “High” rating, others such as Existing LCs were classified as gaps.
The combined Low and Very Low categories translate to a net savings of 38% by following the predictions and skipping those sections entirely. This confirms the practical viability of the approach.
Below is the yield processing efficiency table:
| Matched Ratings | Number of Sections | Total Characters | % of Total Document |
|---|---|---|---|
| Very High | 11 | 128,840 | 29.64% |
| High | 12 | 30,320 | 6.98% |
| Medium | 20 | 25,000 | 5.75% |
| Low | 17 | 9,520 | 2.19% |
| Very Low | 21 | 155,000 | 35.66% |
| Mismatched Rating | 0 | 0 | 0.00% |
| New Section | 21 | 85,960 | 19.78% |
| TOTAL | 102 | 434,640 | 100.00% |
Here are some examples of section ratings:
Node ID Section Header Approx. Chars Entities (Est.) Relations (Est.) Actual Rating Predicted Rating (Index Match) Match Quality
0104 7.14 Financial Covenants 720 12 1 Very High Very High (Financial Covenant) 🟢
0105 8.01 Events of Default 3,200 30 21 Medium Medium (Events of Default) 🟢
0108 Article 9: ADMINISTRATIVE AGENT (Aggregated) 4,880 2 0 Low Low (Duties of Agent) 🟢
0119 Article 10: MISCELLANEOUS (Aggregated) 18,000 2 0 Very Low Very Low (Miscellaneous) 🟢
0144 Schedule 2.01A Commitments 4,000 2 0 Very High Very High (Commitment Schedule) 🟢
0145 Schedule 2.01B L/C Commitments 2,000 2 0 Very Low New Section ⚪
0146 Schedule 2.03 Existing L/Cs 3,000 3 0 Very Low New Section ⚪
0147 Schedule 5.01 Jurisdictions 6,000 2 0 Very Low New Section ⚪
0159 Schedule 5.06 Litigation 5,000 2 5 Very High Very High (Litigation) 🟢
0161 Schedule 5.09 Environmental 8,000 2 5 Very High Very High (Environmental Matters) 🟢
0163 Schedule 5.13 Subsidiaries 40,000 2 5 Very High Very High (Subsidiaries) 🟢And here are a few examples of extractions:
- **Financial Covenants (Very High)**:
- *Entities*: Borrower, Leverage Ratio, Fixed Charge Coverage Ratio
- *Relations*: [Borrower]-(maintains)->[Leverage Ratio]
- **Investments & Liens (High)**:
- *Entities*: Borrower, Lien, Property, Permitted Investments
- *Relations*: [Borrower]-(grants)->[Lien], [Borrower]-(makes)->[Permitted Investments]
- **Defined Terms (Very High)**:
- *Entities*: Adjusted Term SOFR, Base Rate, Defaulting Lender
- *Relations*: IS_A, PART_OF, CONTROLS, ROLE_OF, REFERENCES (Definitions form the ontology backbone, creating canonical entity normalization and robust semantic inheritance)Conclusion
Today’s Knowledge Graph pipelines are fundamentally inefficient. We force costly LLMs to process entire enterprise corpora, even though only a small fraction of those documents contain meaningful relational intelligence.
This article demonstrated that document structure alone can act as a reliable predictor of graph extraction yield.
By integrating Proxy-Pointer’s structural analysis with Graphability Indexing, we can transition Knowledge Graph ingestion from brute-force semantic scanning to targeted structural routing. Rather than repeatedly processing full 500,000-character agreements, the system identifies which regions of a document family consistently produce valuable entities and relationships — and which are mostly boilerplate noise. We can simply bypass the noise entirely, without resorting to workarounds like smaller LLMs to cut costs.
Across three large real-world credit agreements spanning different industries, the index stabilized quickly after just a few iterations and consistently achieved significant payload reductions while maintaining high-value relational extraction.
More importantly, this calls for a shift in how we think about extraction architecture. Rather than treating documents as flat text streams, Proxy-Pointer treats them as structured semantic trees capable of predicting where meaningful knowledge is likely to reside before extraction even begins.
As enterprise GraphRAG systems scale across millions of contracts, filings, policies, and agreements, this kind of structure-aware ingestion could play a key role in making large-scale Knowledge Graph construction operationally sustainable.
Open-Source Repository
Proxy-Pointer is fully open-source (MIT License) and available at the Proxy-Pointer Github repository. You can install it with a single pip command using the package installer.
Clone the repo. Test it on your own documents. I’d love to hear your feedback.
Connect with me and share your thoughts at www.linkedin.com/in/partha-sarkar-lets-talk-AI
The credit agreements referenced here are publicly available at SEC.gov. Code and benchmark results are open-source under the MIT License. Images in this article were generated using Google Gemini.



