Semantic Clustering Of Unstructured Text Using Large Language Model Embeddings And Density-Based Algorithms

Throughout this guide, you’ll explore how to create a text clustering system by merging large language model embeddings with HDBSCAN—a density-based clustering method—to automatically identify topics within unlabeled text collections.

Key areas we’ll address:

How to produce text embeddings from raw documents using a pre-trained sentence-transformers model.
How to lower the dimensionality of those embeddings using UMAP to get them ready for clustering.
How to leverage HDBSCAN to automatically detect topic clusters and display the outcomes visually.

Clustering Unstructured Text with LLM Embeddings and HDBSCAN

Overview

The present wave of Generative AI tends to revolve around chat interfaces and prompt engineering, yet the usefulness of large language models, commonly referred to as LLMs, extends well beyond these areas. In fact, one of their most impactful downstream capabilities is converting raw, unstructured, and often messy text into meaningful numerical representations known as embeddings. Once we have these representations, we can apply them to numerous machine learning tasks—clustering being a prime example.

Specifically, embeddings can be paired with sophisticated, density-based clustering algorithms such as HDBSCAN, which enables the uncovering of latent topics, patterns, or categories within your text document collection—all without requiring any pre-existing labels.

This article walks you through building a text clustering pipeline from the ground up. We’ll work with an openly accessible dataset of text samples, alongside an open-source LLM fine-tuned for embedding generation—commonly referred to as an embedding model. As a bonus, we’ll take advantage of free, modern Python libraries that offer ready-made implementations of clustering algorithms like HDBSCAN.

Detailed Walkthrough

To begin, let’s set up the essential Python libraries we’ll be relying on:

Sentence transformers, for loading a pre-trained LLM from Hugging Face to generate embeddings — you’ll need a Hugging Face access token (API key) to download the model.
Umap-learn, for performing dimensionality reduction on the embeddings.

Additionally, if you’re running things in a local IDE rather than a cloud-based notebook and don’t already have scikit-learn and pandas installed, you may need to add those as well.

!pip install sentence-transformers umap-learn

!pip install sentence–transformers umap–learn

Now we move into the coding phase by sourcing some fresh data. The fetch_20newsgroups function—which retrieves a collection of categorized news article texts—will handle this. It’s worth noting that although the dataset comes with labels, we’ll intentionally discard them, since the goal is to cluster these documents into similarity-based groups as if we had no prior knowledge of their categories. We also trim the dataset down to 150 samples, which is sufficient for demonstration purposes.

import pandas as pd from sklearn.datasets import fetch_20newsgroups # Fetching a highly targeted subset of data (~150-200 docs) categories = [‘sci.space’, ‘sci.med’, ‘rec.autos’] newsgroups = fetch_20newsgroups(subset=”train”, categories=categories, remove=(‘headers’, ‘footers’, ‘quotes’)) # Sampling down into a representative, illustrative subset df = pd.DataFrame({‘text’: newsgroups.data, ‘true_label’: newsgroups.target}) df = df[df[‘text’].str.strip().str.len() > 100].sample(150, random_state=42).reset_index(drop=True) print(f”Loaded {len(df)} text documents.”) print(“nSample document:”) print(df[‘text’].iloc[0][:150] + “…”)

import pandas as pd

from sklearn.datasets import fetch_20newsgroups

# Fetching a highly targeted subset of data (~150-200 docs)

categories = [‘sci.space’, ‘sci.med’, ‘rec.autos’]

newsgroups = fetch_20newsgroups

The next step is to derive embeddings from the raw text data. For this purpose, we load the all-MiniLM-L6-v2 model from Hugging Face’s sentence-transformers library. This is a lightweight yet effective model for generating embeddings quickly.

from sentence_transformers import SentenceTransformer # Loading the free, open-source model model = SentenceTransformer(‘all-MiniLM-L6-v2’) # Encoding text documents into dense vector embeddings print(“Generating vector representations…”) embeddings = model.encode(df[‘text’].tolist(), show_progress_bar=True)

from sentence_transformers import SentenceTransformer

# Loading the free, open-source model

model = SentenceTransformer(‘all-MiniLM-L6-v2’)

# Encoding text documents into dense vector embeddings

print(“Generating vector representations…”)

embeddings = model.encode(df[‘text’].tolist(), show_progress_bar=True)

With the embeddings in hand, we now turn to dimensionality reduction. We employ UMAP (Uniform Manifold Approximation and Projection) to project the high-dimensional vectors into a lower-dimensional space. This makes it possible to visualize the data and can enhance clustering performance.

import umap # Reducing the embeddings to a lower-dimensional space for visualization and clustering reducer = umap.UMAP(n_components=2, random_state=42, metric=’cosine’) embeddings_2d = reducer.fit_transform(embeddings)

import umap

# Reducing the embeddings to a lower-dimensional space for visualization and clustering

reducer = umap.UMAP(n_components=2, random_state=42, metric=‘cosine’)

embeddings_2d = reducer.fit_transform(embeddings)

Next, we apply HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to identify clusters within the reduced embeddings. Unlike some other clustering methods, HDBSCAN does not require specifying the number of clusters beforehand and can detect outliers automatically.

import hdbscan # Clustering the reduced embeddings using HDBSCAN clusterer = hdbscan.HDBSCAN(min_cluster_size=5, metric=’euclidean’) cluster_labels = clusterer.fit_predict(embeddings_2d)

import hdbscan

# Clustering the reduced embeddings using HDBSCAN

clusterer = hdbscan.HDBSCAN(min_cluster_size=5, metric=‘euclidean’)

cluster_labels = clusterer.fit_predict(embeddings_2d)

Finally, we visualize the clusters in a 2D scatter plot, coloring each point according to its assigned cluster label.

import matplotlib.pyplot as plt # Plotting the clusters in a two-dimensional scatter diagram plt.figure(figsize=(10, 7)) plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=cluster_labels, cmap=’Spectral’, s=5) plt.colorbar(boundaries=np.arange(-1, max(cluster_labels)+1)-0.5).set_ticks(range(-1, max(cluster_labels)+1)) plt.title(“Document Clusters Discovered by HDBSCAN”) plt.xlabel(“UMAP Dimension 1”) plt.ylabel(“UMAP Dimension 2”) plt.show()

import matplotlib.pyplot as plt

# Plotting the clusters in a two-dimensional scatter diagram

plt.figure(figsize=(10, 7))

plt.scatter(embeddings_2d[:, 0], embeddings_2d[: 1], c=cluster_labels, cmap=‘Spectral’, s=5)

plt.colorbar(boundaries=np.arange(-1, max(cluster_labels) +1) –0.5).set_ticks(range(-1, max(cluster_labels) +1))

plt.title(“Document Clusters Discovered by HDBSCAN”)

plt.xlabel(“UMAP Dimension 1”)

plt.ylabel(“UMAP Dimension 2”)

plt.show()

Since the original embedding dimension is far too great for effective clustering, we can now leverage a dimensionality reduction method by employing the UMAP algorithm from the previously installed library:

import umap # Shrinking the embedding dimensions down to 5, preserving sufficient density information for clustering reducer = umap.UMAP(n_neighbors=15, n_components=5, min_dist=0.0, random_state=42) reduced_embeddings = reducer.fit_transform(embeddings) print(f”Reduced matrix shape: {reduced_embeddings.shape}”)

import umap

# Shrinking the embedding dimensions down to 5, preserving sufficient density information for clustering

reducer = umap.UMAP(n_neighbors=15, =5, min_dist=0.0, random_state=42)

reduced_embeddings = reducer.fit_transform(embeddings)

print(f“Reduced matrix shape: {reduced_embeddings.shape}”)

At this point, our numeric embedding vectors corresponding to the news articles are composed of merely five dimensions (features). Let’s verify whether this streamlined representation remains expressive enough to produce meaningful clusters by applying the HDBSCAN algorithm — a density-based clustering technique:

from sklearn.cluster import HDBSCAN # Setting up HDBSCAN # min_cluster_size=8: requiring each cluster to contain a minimum of 8 documents clusterer = HDBSCAN(min_cluster_size=8, min_samples=3, store_centers=”centroid”) df[‘cluster’] = clusterer.fit_predict(reduced_embeddings) # Tallying the number of entries per cluster cluster_counts = df[‘cluster’].value_counts() print(“nCluster Distribution:”) print(cluster_counts)

from sklearn.cluster import HDBSCAN

# Setting up HDBSCAN

# min_cluster_size=8: requiring each cluster

Crucial Note: The outcomes of the clustering process are partially shaped by the hyperparameter values we specified for HDBSCAN. I encourage you to experiment with alternative settings for the minimum cluster size and other parameters to see how the results shift.

Output:

Cluster Distribution: cluster 0 101 1 49 Name: count, dtype: int64

Cluster Distribution:

cluster

0 101

1 49

Name: count, dtype: int64

It appears that HDBSCAN identified two clusters corresponding to high-density areas within the data space. Are there also noise points that weren’t assigned to either of these two clusters? Let’s investigate:

for cluster_id in sorted(df[‘cluster’].unique()): if cluster_id == -1: print(“n=== CLUSTER: NOISE / UNCLASSIFIED ===”) else: print(f”n=== CLUSTER: Discovered Topic #{cluster_id} ===”) # Getting up to 3 sample texts from this cluster samples = df[df[‘cluster’] == cluster_id][‘text’].head(3).tolist() for i, sample in enumerate(samples, 1): clean_sample = ” “.join(sample.split())[:120] print(f” {i}. {clean_sample}…”)

for cluster_id in sorted(df[‘cluster’].unique()):

if cluster_id == –1:

print(“n=== CLUSTER: NOISE / UNCLASSIFIED ===”)

else:

print(f“n=== CLUSTER: Discovered Topic #{cluster_id} ===”)

# Getting up to 3 sample texts from this cluster

samples = df[dfYou are a paraphrasing software that takes an article in HTML format and rewrite it in a way that is easy to read and understand, Keep HTML as-is, change the text as far as you can. Do not change the content language: [‘cluster’] == cluster_id][‘text’].head(3).tolist()

for i, sample in enumerate(samples, 1):

clean_sample = ” “.join(sample.split()[:120]

print(f” {i}. {clean_sample}…”)

Output:

=== CLUSTER: Discovered Topic #0 === 1. Okay Mr. Dyer, we’re properly impressed with your philosophical skills and ability to insult people. You’re a wonderful … 2. I was at an interesting seminar at work (UK’s R.A.L. Space Science Dept.) on this subject, specifically on a small-scale… 3. This is the second post which seems to be blurring the distinction between real disease caused by Candida albicans and t… === CLUSTER: Discovered Topic #1 === 1. It’s great that all these other cars can out-handle, out-corner, and out- accelerate an Integra. But, you’ve got to ask … 2. l diamond star cars (Talon/Eclipse/Laser) put out 190 hp in the turbo models, and 195 hp in the AWD turbo models, These … 3. Sorry for the mis-spelling, but I forgot how to spell it after my series of exams and NO-on hand reference here. Is it s…

=== CLUSTER: Discovered Topic #0 ===

1. Okay Mr. Dyer, we‘re properly impressed with your philosophical skills and ability to insult people. You’re a wonderful ...

2. I was at an interesting seminar at work (UK‘s R.A.L. Space Science Dept.) on this subject, specifically on a small-scale…

3. This is the second post which seems to be blurring the distinction between real disease caused by Candida albicans and t…

=== CLUSTER: Discovered Topic #1 ===

1. It’s great that all these other cars canout–handle, out–corner, and out– accelerate anIntegra. But, you‘ve got to ask ...

2. l diamond star cars (Talon/Eclipse/It appears that every sample in the 150-item set was assigned to one of the two detected clusters, suggesting that the news pieces may be readily distinguishable by theme.

For further illustration, we can generate cluster plots using the additional code shown below. It produces a scatterplot for each possible pair of the five components that characterize each sample:

import matplotlib.pyplot as plt import seaborn as sns import itertools # Build a DataFrame holding the 5 reduced embeddings and cluster labels reduced_df = pd.DataFrame(reduced_embeddings, columns=[f’UMAP_D{i+1}’ for i in range(reduced_embeddings.shape[1])]) reduced_df[‘cluster’] = df[‘cluster’] # Produce all unique pairings of the 5 dimensions dim_pairs = list(itertools.combinations(reduced_df.columns[:-1], 2)) num_plots = len(dim_pairs) num_cols = 3 num_rows = (num_plots + num_cols – 1) // num_cols plt.figure(figsize=(num_cols * 5, num_rows * 4)) for i, (dim1, dim2) in enumerate(dim_pairs): plt.subplot(num_rows, num_cols, i + 1) sns.scatterplot( x=dim1, y=dim2, hue=”cluster”, data=reduced_df, palette=”viridis”, s=70, alpha=0.7, legend=’full’ ) plt.title(f'{dim1} vs {dim2}’) plt.xlabel(dim1) plt.ylabel(dim2) plt.grid(True, linestyle=”–“, alpha=0.6) plt.tight_layout() plt.show()

import matplotlib.pyplot as plt

import seaborn as sns

import itertools

# Create a DataFrame holding the 5 reduced embeddings and cluster labels

reduced_df = pd.DataFrame(reduced_embeddingsreduced_embeddings, columns=[f‘UMAP_D{i+1}’ for i in range(reduced_embeddings.shape[1])])

Below is a version where the HTML structure is preserved as‑is while the natural‑language text is rewritten.

# Collect every possible pair of the five dimensions

dim_pairs = list(itertools.combinations(reduced_df.columns[:–1], 2))

num_plots = len(dim_pairs)

num_cols = 3

num_rows = (num_plots + num_cols – 1) // num_cols

plt.figure(figsize=(num_cols * 5, num_rows * 4))

for i, (dim1, dim2) in enumerate(dim_pairs):

plt.subplot(num_rows, num_cols, i + 1)

sns.scatterplot(

x=dim1,

y=dim2,

hue=‘cluster’,

data=reduced_df,

palette=‘viridis’,

s=70,

alpha=0.7,

legend=‘full’

)

plt.title(f‘{dim1} vs {dim2}’)

plt.xlabel(dim1)

plt.ylabel(dim2)

plt.grid(True, linestyle=‘–‘, alpha=0.6)

plt.tight_layout()

plt.show()

Reference output:

When you experiment with various HDBSCAN settings, you might observe that the model reveals a number of clusters that differs from just two. Feel free to explore different options!

Conclusion

Having walked through the steps to build a text-based clustering pipeline, it is useful to highlight the main reasons why combining LLM embeddings with HDBSCAN is beneficial. On one hand, the embeddings produced by sentence-transformers are able to preserve and reflect, to a significant degree, the real semantic meaning and subtle linguistic variations of the source text. On the other hand, HDBSCAN automatically identifies a suitable number of clusters and can flag noisy points that may represent outliers, helping to avoid distortion in aggregated statistics.

Carter

Top Posts

How we built saga rollbacks for Cloudflare Workflows

Identiv Divests IoT Assets in Strategic Handover to Trackonomy

ARM Institute Launches Physical AI Expansion of RoboticsCareer.org

Semantic Clustering of Unstructured Text Using Large Language Model Embeddings and Density-Based Algorithms

Conclusion

5 Powerful Strategies for Bulletproof Outlier Detection

Trustworthy AI for Lung Cancer Diagnosis: A Conformal Uncertainty-Aware Framework for Non-Small Cell Lung Cancer

Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

Why I Abandoned Solo AI Agents and Switched to a Multi-Agent Pipeline

Unveiling the Hype: What Makes WebMCP a Game-Changer

Medical Frontiers: How Advanced Reasoning Models Are Revolutionizing Healthcare Thinking

How we built saga rollbacks for Cloudflare Workflows

Identiv Divests IoT Assets in Strategic Handover to Trackonomy

ARM Institute Launches Physical AI Expansion of RoboticsCareer.org

The Secret Architecture of OpenAI’s Jalapeño Chip

Semantic Clustering of Unstructured Text Using Large Language Model Embeddings and Density-Based Algorithms

The 2036 Shift: The Rise of the Sovereigns

The Hidden Threat: How Shared Data Creates Silent AI Agent Vulnerabilities

A Surprising Choice: Trump’s Unconventional Pick for Defense Acquisition Deputy

Trending

How we built saga rollbacks for Cloudflare Workflows

Identiv Divests IoT Assets in Strategic Handover to Trackonomy

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Semantic Clustering of Unstructured Text Using Large Language Model Embeddings and Density-Based Algorithms

Overview

Detailed Walkthrough

Conclusion

Related Posts