article, we deliberate to sort out one of many major challenges in doc summarization, i.e., dealing with paperwork which can be too giant for a single API request. We additionally explored the pitfalls of the notorious ‘Lost in the Middle’ downside and demonstrated how clustering strategies like Ok-means will help construction and handle the knowledge chunks successfully.
We divided the GitLab Worker Handbook into chunks, used an embedding mannequin to transform these chunks of textual content into numerical representations known as vectors.
Now, within the lengthy overdue (sorry!) Half 2, we are going to get to the meaty (no offense, vegetarians) stuff, enjoying with the brand new clusters we created. With our clusters in place, we are going to give attention to refining summaries in order that no crucial context is misplaced. This text will information you thru the following steps to rework uncooked clusters into actionable and coherent summaries. Therefore, enhancing present Generative AI (GenAI) workflows to deal with even essentially the most demanding doc summarization duties!
A fast technical refresher
Okay, class! I’m going to concisely go over the technical steps we now have taken till now in our options strategy:
- Recordsdata required
A giant doc, in our case, we’re utilizing the GitLab Worker Handbook, which will be downloaded right here. - Instruments required:
a. Programming Language: Python
b. Packages: LangChain, LangChain Neighborhood, OpenAI, Matplotlib, Scikit-learn, NumPy, and Pandas - Steps adopted till now:
Textual Preprocessing:
- Cut up paperwork into chunks to restrict token utilization and retain semantic construction.
Characteristic Engineering:
- Utilized OpenAI embedding mannequin to transform doc chunks into embedding vectors, retaining semantic and syntactic illustration, permitting simpler grouping of comparable content material for LLMs.
Clustering:
- Utilized Ok-means clustering to the generated embeddings, grouping embeddings sharing related meanings into teams. This diminished redundancies and ensured correct summarization.
A fast reminder notice, for our experiment, the handbook was break up into 1360 chunks; the whole token depend for these chunks got here to 220035 tokens, the embeddings for every of these chunks produced a 1272-dimensional vector, and we lastly set an preliminary depend of clusters to 15.
Too technical? Consider it this manner: you dumped a complete workplace’s archive on the ground. While you divide the pile of paperwork into folders, that’s chunking. Embedding would connect a novel “fingerprint” to these folders. And at last, once you compartmentalize these folders into totally different subjects, like monetary paperwork collectively, and coverage documentations collectively, that effort is clustering.
Class is resumed…welcome again from the vacations!
6 Now that all of us have a fast refresher (if it wasn’t detailed sufficient, you possibly can examine the half 1 linked above!), let’s see what we shall be doing with these clusters we bought, however earlier than, allow us to have a look at the clusters themselves.
# Show the labels in a tabular format
import pandas as pd
labels_df = pd.DataFrame(kmeans.labels_, columns=["Cluster_Label"])
labels_df['Cluster_Label'].value_counts()In layman’s phrases, this code is just counting the variety of labels given to every chunk of content material. That’s all. In different phrases, the code is asking: “after sorting all the pages into topic piles according to which cluster each page belongs to, how many pages are in each topic pile?” The scale of every of those clusters is necessary to know, as giant clusters point out broad themes throughout the doc, whereas small clusters might point out area of interest subjects or content material that’s included within the doc however that doesn’t seem fairly often.
The Cluster Label Counts Desk proven above reveals the distribution of the embedded textual content chunks throughout the 15 clusters shaped after the Ok-means clustering course of. Every cluster represents a grouping of semantically related chunks. We are able to see from the distribution the dominant themes within the doc and prioritize summarization efforts for bigger clusters whereas not overlooking smaller or extra area of interest clusters. This ensures that we don’t lose crucial context in the course of the summarization course of.
Getting up shut and private
7 Let’s dive deeper into understanding our clusters, as they’re the muse of what is going to primarily turn into our abstract. For this, we shall be producing a couple of insights concerning the clusters themselves to know their high quality and distribution.
To carry out our evaluation, we have to implement what is called Dimensionality Discount. That is nothing greater than decreasing the variety of dimensions of our embedding vectors. If the category remembers, we had mentioned how every vector will be of a number of dimensions (values) to explain any given phrase/sentence, relying on the logic and math the embedding mannequin follows (eg [2, 3, 5]). For our mannequin, the produced vectors have a dimensionality of 1272, which is sort of intensive and not possible to visualise (as a result of people can solely see in 3 dimensions, i.e., 3D).
It’s like making an attempt to make a tough ground plan of an enormous warehouse filled with bins organized in response to a whole lot of refined traits. The plan is not going to embody the entire particulars of the warehouse and its contents, however it could possibly nonetheless be immensely helpful in figuring out which of the bins are usually grouped.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from umap import UMAP
chunk_embeddings_array = np.array(chunk_embeddings)
num_clusters = 15
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(chunk_embeddings_array)
silhouette_avg = silhouette_score(chunk_embeddings_array, labels)
umap_model = UMAP(n_components=2, random_state=42)
reduced_data_umap = umap_model.fit_transform(chunk_embeddings_array)
cmap = plt.cm.get_cmap("tab20", num_clusters)
plt.determine(figsize=(12, 8))
for cluster in vary(num_clusters):
factors = reduced_data_umap[labels == cluster]
plt.scatter(
factors[:, 0],
factors[:, 1],
s=28,
alpha=0.85,
colour=cmap(cluster),
label=f"Cluster {cluster}"
)
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title(f"UMAP Scatter Plot of Book Embeddings (Silhouette Score: {silhouette_avg:.3f})")
plt.legend(title="Cluster", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.present()The embeddings are first transformed right into a NumPy array (for processing effectivity). Ok-means then assigns a cluster label to every chunk, after which we calculate the silhouette rating to estimate how nicely separated the clusters are. Lastly, UMAP reduces the 1272-dimensional embeddings to 2 dimensions so we are able to plot every chunk as a coloured level.
However…what’s UMAP?
Think about you run an enormous bookstore and somebody fingers you a spreadsheet with 1,000 columns describing each e book: style, tone, pacing, sentence size, themes, critiques, vocabulary, and extra. Technically, that could be a very wealthy description. Virtually, it’s not possible to see. UMAP helps by squeezing all of that high-dimensional data down right into a 2D or 3D map, whereas making an attempt to maintain related gadgets close to one another. In machine-learning phrases, it’s a dimensionality-reduction technique used for visualization and other forms of non-linear dimension discount.

So what are we really right here? Every dot is a piece of textual content from the handbook. Dots with the identical colour belong to the identical cluster. When the same-colored dots bunch collectively properly, that means the cluster within reason coherent. When totally different colours overlap closely, that tells us the doc subjects might bleed into each other, which is truthfully not surprising for an actual worker handbook that mixes coverage, operations, governance, platform particulars, and all types of enterprise life varieties.
Some teams within the plot are pretty compact and visually separated, particularly these out on the fitting facet. Others overlap within the middle like attendees at a networking occasion who all hold drifting between conversations. That’s helpful to know. It tells us the clusters are informative, however not magically excellent. And that, in flip, is strictly why we should always deal with clustering as a sensible device moderately than a sacred revelation handed down by the algorithm gods.
However! What’s a Silhouette Rating?! and what does 0.056 imply?!
Good query, younger Padawan, reply you shall obtain under.
Yeah, I’m not satisfied but with our Clusters
8 Wow, what a tricky crowd! However I like that, one should not belief the graphs simply because they give the impression of being good, let’s dive into numbers and consider these clusters.
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score
calinski_score = calinski_harabasz_score(chunk_embeddings_array, kmeans.labels_)
davies_score = davies_bouldin_score(chunk_embeddings_array, kmeans.labels_)
print(f"Calinski-Harabasz Score: {calinski_score}")
print(f"Davies-Bouldin Score: {davies_score}")
Calinski-Harabasz Rating: 25.1835818236621
Davies-Bouldin Rating: 3.566234372726926Silhouette Rating: 0.056
This one already seems within the UMAP plot title. I like to clarify the silhouette rating with a celebration analogy. Think about each visitor is meant to face with their very own buddy group. A excessive silhouette rating means most individuals are standing near their very own group and much from everybody else. A low rating means persons are floating between circles, half-listening to 2 conversations, and usually inflicting social ambiguity. Right here,
0.056is low, which tells us the handbook subjects overlap fairly a bit. That isn’t very best, however it’s also not disqualifying. Actual-world paperwork are messy, and helpful clusters should not have to seem like flawless textbook examples.Calinski-Harabasz Rating: 25.184 (rounded up)
This metric rewards clusters which can be internally tight and nicely separated from one another. Consider a faculty cafeteria. If every buddy group sits shut collectively at its personal desk and the tables themselves are nicely spaced out, the cafeteria appears to be like organized. That’s the form of sample Calinski-Harabasz likes. In our case, the rating provides us another sign that there’s some construction within the knowledge, even when it’s not completely crisp.
Davies-Bouldin Rating: 3.567 (rounded up)
The final metric measures the diploma of overlap between clusters; the decrease the higher. Let’s return to the varsity cafeteria from the earlier instance. If every desk of scholars caught to their very own conversations, then the din of the room feels coherent. But when every desk was having conversations with others as nicely, that too to totally different levels, the room would really feel chaotic. However there’s a catch right here, for paperwork, particularly giant ones, it’s necessary to keep up the context of data all through the textual content. Our Davies-Bouldin Rating tells us there may be significant overlap however not an excessive amount of to keep up a wholesome separation of considerations.
Effectively, hopefully 3 metrics with stable numbers backing them are ok to persuade us to maneuver ahead with confidence in our clustering method.
It’s time to signify!
9 Now that we all know the clusters are at the very least directionally helpful, the following query is: how can we summarize them with out summarizing all 1360 chunks one after the other? The reply is to choose a consultant instance from every cluster.
# Discover the closest embeddings to the centroids
# Create an empty listing that may maintain your closest factors
closest_indices = []
# Loop by the variety of clusters you've got
for i in vary(num_clusters):
# Get the listing of distances from that exact cluster middle
distances = np.linalg.norm(chunk_embeddings_array - kmeans.cluster_centers_[i], axis=1)
# Discover the listing place of the closest one (utilizing argmin to seek out the smallest distance)
closest_index = np.argmin(distances)
# Append that place to your closest indices listing
closest_indices.append(closest_index)
selected_indices = sorted(closest_indices)
selected_indicesNow right here is the place some mathematical magic occurs. We all know that every cluster is basically a bunch of numbers, and in that group, there shall be a centre, additionally recognized within the calculus world because the centroid. The centroid is basically the centre level of the item. We then measure how far every chunk is from this centroid; this is called its Euclidean distance. Vectors which have the least Euclidean distance from their respective centroids are chosen from every cluster. Giving us a vector of vectors that signify every cluster one of the best (most semantically).
This half works by pulling out the only most telling sheet from each stack of paperwork, kind of how one would choose the clearest face in a crowd. Relatively than make the LLM undergo all pages, it will get handed simply the standout examples firstly. Working this within the pocket book gave again these particular chunk positions.
[110, 179, 222, 298, 422, 473, 642, 763, 983, 1037, 1057, 1217, 1221, 1294, 1322]Meaning our subsequent summarization stage works with fifteen strategically chosen chunks moderately than all 1360. That could be a severe discount in effort with out resorting to random guessing.
Can we begin summarizing the doc already?
10 Okay, sure, I apologize, it’s been a bunch of math-bombing and never a lot doc summarizing. However from right here on, within the subsequent few steps, we are going to give attention to producing essentially the most consultant summaries for the doc.
For every consultant chunk per cluster, we plan to summarize each by itself (since it’s textual content on the finish of the day). That is virtually akin to a map-reduce type summarization circulation the place we deal with every chosen chunk as an area unit, summarize it, and save the outcome.
from langchain. prompts import PromptTemplate
map_prompt = """
You'll be given a single passage of a e book. This part shall be enclosed in triple backticks (```)
Your aim is to present a abstract of this part so {that a} reader could have a full understanding of what occurred.
Your response must be at the very least three paragraphs and absolutely embody what was stated within the passage.
```{textual content}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])There’s nothing mystical occurring right here. We’re merely telling the mannequin, “Take one chunk at a time and explain it thoroughly.” That is a lot simpler for the mannequin than making an attempt to motive over your entire handbook in a single go. It’s the distinction between asking somebody to summarize one chapter they simply learn versus asking them to summarize an enormous handbook they solely skimmed whereas boarding a practice.
from langchain.chains.summarize import load_summarize_chain
map_chain = load_summarize_chain(llm=llm3,
chain_type="stuff",
immediate=map_prompt_template)
selected_docs = [splits[doc] for doc in selected_indices]
# Make an empty listing to carry your summaries
summary_list = []
# Loop by a spread of the size of your chosen docs
for i, doc in enumerate(selected_docs):
# Go get a abstract of the chunk
chunk_summary = map_chain.run([doc])
# Append that abstract to your listing
summary_list.append(chunk_summary)
print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} n")This block of code designs and wires the immediate right into a summarization chain, grabs the 15 consultant chunks, after which loops by them one after the other. Every chunk is summarized by itself, which is appended to an inventory. In follow, this implies we’re creating 15 native summaries, every representing one main area of the doc.

So the pocket book outputs might be a bit rough-looking, so I used my trusted GPT 5.4 to make it look good for us! We are able to see that every of these consultant chunks covers a broad vary of the handbook’s major subjects: harassment coverage, stockholder assembly necessities, compensation committee governance, knowledge workforce reporting, warehouse design, Airflow operations, Salesforce renewal processes, pricing buildings, CEO shadow directions, pre-sales expectations, demo techniques infrastructure, and extra. This type of data extraction is strictly what we’re aiming for. We’re not simply getting 15 random pages from the handbook; we’re sampling the handbook’s major thematic unfold.
Was all of it price it?
11 We’ll now ask the LLM to summarize these summaries into one wealthy overview. However earlier than we begin continuing and pop the champagne, let’s see if doing all the maths and multi-summary technology has really paid off in decreasing reminiscence and LLM context load. We take the 15 summaries after which simply be a part of them advert hoc (for now), then convert that into its unique doc format and depend the tokens.
from langchain.schema import Doc
summaries = "n".be a part of(summary_list)
# Convert it again to a doc
summaries = Doc(page_content=summaries)
print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")
Your whole abstract has 4219 tokensSuccess! This new intermediate doc is far smaller than the supply. The mixed abstract weighs in at 4219 tokens, which is a far cry from the unique 220035-token beast. We’ve achieved a 98% discount in context window token consumption!
That is the form of optimization that makes an enterprise workflow sensible. We didn’t fake that the unique doc is small; we’re constructing a compact proxy for it that also carries the foremost themes ahead.
Singularity
12 Now we’re prepared for the ultimate “reduce” half and to converge all of the summaries we now have generated into the ultimate holistic doc abstract.
combine_prompt = """
You'll be given a sequence of summaries from a e book. The summaries shall be enclosed in triple backticks (```)
Your aim is to present a verbose abstract of what occurred within the story.
The reader ought to be capable to grasp what occurred within the e book.
```{textual content}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
reduce_chain = load_summarize_chain(llm=llm4,
chain_type="stuff",
immediate=combine_prompt_template,
verbose=True # Set this to true if you wish to see the internal workings
)
output = reduce_chain.run([summaries])
print (output)We begin by making a second summarization immediate and making a second summarization chain. The intermediate doc we created within the earlier step is then fed because the enter for this chain. In easy phrases, first we made the mannequin perceive every of the boroughs of NYC, and now we’re asking it to know NYC as a complete utilizing these understandings.

As we are able to see, the ultimate output does learn nicely. It’s clear in data and fairly straightforward to comply with. However right here is the marginally awkward half: the report leans a lot more durable into the demo techniques and Kubernetes elements of the handbook than into the total unfold of subjects we noticed earlier. This doesn’t imply that the entire workflow collapsed and the experiment failed.
The smaller cluster summaries touched governance, pricing, Salesforce, Airflow, Okta, buyer engagement, and so forth. By the point we reached the ultimate mixed abstract, a lot of that had thinned out. So sure, the prose bought cleaner, however the protection additionally bought narrower.
Why did this occur? What can we do to enhance on this? Let’s have a look at these questions extra in-depth.
The place did we go Proper?
Enterprise paperwork are at all times messy. The subjects inside their content material overlap, the helpful items of data can seem wherever, and sending the entire thing in a single shot is just too costly and ensures inaccuracies.
By clustering the break up doc chunks, selecting a reasonably dependable consultant out of these chunks, after which utilizing them to summarize, we bought one thing far more usable than brute forcing the entire handbook by one immediate. The LLM is now not strolling round a minefield blind.
We had been in a position to take a 220035-token handbook and scale back it to a manageable set of consultant chunks of textual content. The preview summaries lined a broad vary of related themes of the handbook.
The intermediate abstract of the chunks shrank the issue once more into one thing the mannequin may really work with. So despite the fact that the reducer butterfingers the final handoff a bit, the outcomes earlier than it present that clustering and representative-chunk choice make this downside far simpler to deal with in a dependable means.
The place did we go Unsuitable?
Simply as we acknowledge and acknowledge our strengths, we should additionally acknowledge our weaknesses. This technique will not be excellent, and its flaws are evident. The chunk-summary step preserved a various vary of themes, however the ultimate scale back and summarize step narrowed that range. Satirically, this led to a second spherical of the identical downside we had been making an attempt to keep away from: necessary data was misplaced throughout aggregation, even after it was preserved upstream.
Nonetheless, a single consultant textual content chunk can miss nuances from the cluster. Overlapping clusters can blur the subject boundaries. The ultimate synthesized LLM interplay can give attention to the strongest or most detailed theme within the batch, as seen on this case. This doesn’t render the workflow ineffective; it highlights the areas for enchancment.
The following spherical of fixes ought to embody a stronger discount immediate that requires protection throughout main themes, a number of representatives per cluster (rising the variety of centroids), and a ultimate topical-sanity examine in opposition to the knowledge unfold noticed within the previews.
If this workflow is utilized in domains the place knowledge loss is crucial, equivalent to drugs, authorized overview, or safety, then validation of the ultimate output is important. Moreover, retrieval layers or a human-in-the-loop suggestions step could also be crucial.
“Useful” doesn’t suggest “infallible.” It means we now have a scalable system that’s ok to be taught from and price enhancing.
Class Dismissed, This Time for Actual
Half 1 was about surviving the size downside. Half 2 was about turning that survival technique into an precise summarization pipeline. We began with 1360 chunks from a 220035-token handbook, grouped them into 15 clusters, visualized their construction, sanity-checked the grouping high quality, picked consultant chunks, summarized them individually, compressed these summaries right into a 4219-token intermediate doc, after which generated a ultimate mixed abstract.
Clustering helps with the size downside. Consultant-chunk choice provides the workflow extra construction. However the ultimate summarization immediate nonetheless wants tuning for the whole-document protection. To me, that’s the sensible worth of this experiment. It provides us one thing helpful proper now, and it additionally factors fairly clearly to what we should always repair subsequent.
So no, this isn’t a neat little mission achieved ending. I believe that’s higher, truthfully. We now have a summarization pipeline that works nicely sufficient to show us one thing actual: retaining breadth alive within the ultimate aggregation step issues simply as a lot as decreasing the doc within the first place.

If in case you have made it this far, thanks once more for studying and for tolerating my classroom metaphors. I hope this helped make large-document summarization really feel rather less prefer it’s all AI magic and a bit extra buildable.



