“Beauty will save the world” — Fyodor Dostoevsky
A. Introduction
Modern AI didn’t appear out of nowhere. Today’s transformer-based tools can seem almost magical, able to grasp context and even the subtle connections between concepts. However, the roots of today’s semantic search technology developed slowly over time. Long before embeddings, transformers, and large language models existed, researchers relied on keyword matching, TF–IDF vectors, and conventional machine learning techniques to process text.
Many of those early approaches never really went away. In reality, current systems still draw on ideas that were created decades ago. The discipline advanced step by step, with each wave of innovation addressing certain challenges while revealing others.
Grasping this progression matters. In machine learning, just as in science overall, understanding our origins often clarifies where we’re headed. The story of semantic search also mirrors a broader transformation in AI: a move from clear, human-built systems to increasingly powerful models whose inner logic is far harder to decipher. Essentially, we’ve shifted from straightforward retrieval rules and hand-tuned features to systems that can learn abstract representations of meaning straight from data.
In this article, we’ll trace that journey through a hands-on example: comparing a student’s art critique with expert critiques of the same painting. Rather than diving straight into embeddings and transformers, we’ll build a series of increasingly refined retrieval systems, weighing both their advantages and their shortcomings.
We’ll walk through four key stages in the development of semantic search:
- Method 1 — Handcrafted Retrieval Features + TF–IDF
A clear, explainable ranking system that blends TF–IDF cosine similarity with understandable features like keyword overlap, critique length normalization, and recency weighting. - Method 2 — Classical Machine Learning for Semantic Ranking
Pairing TF–IDF feature vectors with supervised learning models such as Logistic Regression to learn ranking patterns from labeled data. - Method 3 — Embedding-Based Semantic Search
Swapping sparse word-based representations with dense semantic embeddings produced by Sentence Transformers. - Method 4 — Transformer Fine-Tuning
Fine-tuning pretrained transformer models like BERT to directly capture semantic relationships between critiques.
Figure 1 below illustrates how semantic search methods have evolved.
By the conclusion, we’ll have built increasingly powerful semantic search pipelines. Along the way, we’ll also develop a deeper understanding of how the field itself has transformed — from systems built primarily on human-designed features to models that learn meaning directly from data.
B. Data
To keep our attention on semantic search rather than on building datasets, we’ll work with a small synthetic collection of art critiques. The dataset was carefully crafted to reflect realistic differences in vocabulary, writing style, interpretation, and depth of analysis among critics reviewing the same painting.
Each critique includes both metadata and free-form text. Our goal throughout the article will be to compare a new student’s critique against expert critiques of the same painting and assess semantic similarity using progressively more advanced retrieval techniques.
The structure of each critique is defined using a simple Python dataclass:
@dataclass
class Critique:
critique_id: str
painting_id: str
critic_name: str
title: str
text: str
published_at: datetimeThe text field above holds the core critique content used for semantic analysis, while fields like painting_id, critic_name, and published_at supply metadata that can aid filtering, grouping, or ranking experiments.
A sample critique might look like this:
Critique(
critique_id="c102",
painting_id="starry_night",
critic_name="Dr. Elaine Foster",
title="Emotion Through Motion",
text="""
Van Gogh transforms the night sky into a structure that seems alive.
The swirling brushstrokes generate tension on the soul while the
exaggerated brightness of the stars creates a dreamlike atmosphere.
""",
published_at=datetime(2021, 5, 12)
)Although synthetic, the dataset is detailed enough to illustrate the core principles behind semantic retrieval systems — ranging from basic keyword-based similarity to transformer-driven representations of meaning.
Please note that the code for all four methods is available on Github. The exact directory is listed at the end of the article.
C. Methods
C.1 Method 1 — Rule-Based Retrieval and TF–IDF Ranking
We start with one of the most traditional and transparent approaches to semantic search: combining TF–IDF ranking with a small set of handcrafted retrieval features. While modest compared to today’s deep learning systems, this method captures many of the foundational ideas behind document retrieval and similarity scoring. At this stage, the system doesn’t genuinely “understand” language. Instead, it spots patterns in word usage and blends them with manually designed scoring rules.
The backbone of this method is TF–IDF (Term Frequency–Inverse Document Frequency), a well-established technique for turning text into numerical vectors. TF–IDF boosts the weight of words that appear often within a single document but are relatively rare across the entire collection. Common words like “the” or “painting” get very little weight, while more distinctive terms like “composition,” “contrast,” or “symbolism” carry greater influence.
After training the TF–IDF vectorizer on the expert critiques, the system generates a sparse document-term matrix stored in self.matrix. Each row represents a critique, each column corresponds to a learned term or phrase, and the numerical values reflect TF–IDF weights.
Once the critiques are vectorized, cosine similarity can be applied to measure how similar two documents are. Cosine similarity calculates the angle between two vectors in a high-dimensional space. When two critiques use similar vocabulary in similar proportions, their vectors point
These two phrases point to nearly identical concepts, which is why they tend to receive higher similarity scores.
In real-world applications, relying solely on TF–IDF similarity often falls short. Two reviews might express comparable artistic thoughts using entirely different language, while others may seem superficially alike just because they use the same technical terms. To enhance the quality of retrieval, we blend TF–IDF similarity with a set of additional rule-based features.
The heuristic scoring framework incorporates:
- Keyword overlap — evaluates the number of significant terms common to both critiques
- Length normalization — gives preference to critiques offering a reasonable amount of descriptive content without overly rewarding lengthy text
- Recency weighting — slightly prioritizes more recent critiques through exponential time-based decay
The overall ranking score is calculated as follows:
(Equation 1)
Each component is deliberately scaled between 0 and 1. We still use clipping as a basic safeguard:
np.clip(value, 0.0, 1.0)In our scenario, clipping is effective because the features are inherently bounded. In larger production environments, however, features with broader numerical ranges—such as popularity metrics or citation counts—would generally need proper normalization.
The length normalization component favors critiques that include enough descriptive richness. Assuming a target length of 250 words, the formula is:
(Equation 2)
For instance, a critique containing 125 words earns a score of 0.5. Those with 250 words or more achieve the full score of 1.0.
The recency component introduces a bias toward newer critiques while still keeping older reviews in play:
(Equation 3)
With a half-life set at approximately 10 years:
- A critique posted today gets a score near 1.0
- One written 10 years ago gets around 0.5
- A critique from 20 years ago receives roughly 0.25
This produces a gradual sense of “freshness” akin to approaches traditionally used in search engines and recommendation platforms.
A major advantage of this method is its transparency. Every aspect of the ranking logic is open and clear. We can easily determine why one critique outranks another by looking at how each feature contributed.
To validate the approach, we build a small synthetic dataset of expert critiques all discussing the same painting. We then input a new student critique and have the system find the most similar expert analyses. The student critique reads:
student_critique_text = """
The painting evokes a calm emotional mood, yet deeply impactful.
The gentle lighting and subdued color scheme
make the main figure appear isolated yet dignified. The background
doesn't distract from the subject; rather, it enhances the feeling
of contemplation and tranquility. Overall, the piece feels personal,
introspective, and thoughtfully arranged.
"""Finally, the program calculates a similarity score between the student critique and each expert critique, presented below in Table 1.
| CRITIQUE TITLE | EXPERT NAME | SCORE |
| Light and Stillness | Expert A | 0.531 |
| Psychological Interior | Expert D | 0.297 |
| Narrative and Gesture | Expert E | 0.224 |
| Color and Surface | Expert B | 0.212 |
| Historical Symbolism | Expert C | 0.096 |
The results are logical. The student critique highlighted soft lighting, emotional restraint, and psychological depth. These themes closely mirror the language found in two expert critiques, namely Light and Stillness and Psychological Interior. Critiques centered mainly on symbolism, brushwork technique, or historical context scored lower because they shared fewer lexical and heuristic commonalities.
At the same time, the shortcomings of TF–IDF are already apparent. The approach mainly picks up on surface-level word patterns rather than deeper meaning. For example, expressions like “dramatic use of light” and “strong chiaroscuro effects” may convey very similar artistic concepts while using almost no identical words. Classical
Retrieval systems frequently face challenges in these scenarios because they rely primarily on word-level matching.
These shortcomings drive the next phase in semantic search development: machine learning models that learn to rank results from data instead of depending mainly on hand-crafted scoring formulas.
C.2 Method 2 – Classical Machine Learning with TF-IDF Features
The following advancement in semantic search swaps out hand-designed scoring formulas with supervised machine learning. Rather than manually setting how much weight to give TF-IDF similarity, keyword overlap, or other rule-based features, we let a model discover useful patterns straight from labeled training data.
For this approach, we work with a different set of painting critiques than the one used in the earlier method. In this dataset, certain critiques are tagged as “expert-like,” while others are marked as more beginner-level analyses. Instead of ordering critiques by similarity, the objective here is to build a classifier capable of predicting whether a critique mirrors expert-level analysis.
As in the previous method, the first step is TF-IDF vectorization. Every critique is turned into a high-dimensional numerical vector where the values reflect how important words and phrases are within the document set. But rather than measuring vectors against each other with cosine similarity, we pass these TF-IDF features into a supervised learning algorithm like Logistic Regression.
Logistic Regression is a well-established machine learning technique for classification. Rather than following manually coded rules, the model picks up patterns from training examples. It figures out which words and writing styles appear more frequently in expert critiques and then applies those patterns to assess new critiques on its own. This marks a key transition because the system now draws insights from data instead of leaning on manually built rules.
The code snippet below illustrates the pipeline built from the TfidfVectorizer and Logistic Regression.
python
model = Pipeline([
(“tfidf”, TfidfVectorizer(
ngram_range=(1, 2),
lowercase=True,
min_df=1,
stop_words=”english”
)),
(“classifier”, LogisticRegression())
])
Once trained, the model can examine a fresh student critique and generate both:
a predicted class label
a probability score showing how likely the critique is to be expert-like
A probability near 1 signals strong resemblance to expert critiques, while a probability close to 0 points to more novice-level writing. By default, probabilities at or above 0.5 receive label 1 (“expert-like”), and probabilities below 0.5 receive label 0. Our new critique was assigned a label of 1 with a probability of 0.672.
One of the most appealing qualities of Logistic Regression is interpretability. Since the model assigns numerical weights to each TF-IDF feature, we can directly examine which words and phrases shape the classification outcomes.
In this test, the classifier placed higher importance on terms like “placement,” “emotional,” “depth,” “psychological,” “intensity,” and “shadow.” When reading through the critiques, this result seems logical because these words tend to show up in expert-like critiques that explore structure, symbolism, interpretation, or spatial composition in greater detail. On the other hand, phrases such as “beautiful,” “artist wanted,” and “think” were given lower weights. These phrases appear more often in novice-like critiques, which center on broad impressions rather than in-depth analysis. After training, we can review the learned weights and identify which words drove the predictions.
Feature Logistic Regression Coefficient
emotional 0.150719
placement 0.148277
depth 0.146912
contrast 0.146912
At the same time, we should be cautious about overstating what the model actually does. The model is not truly interpreting the artwork or grasping its symbolism the way a human expert would. It is merely spotting patterns in the language found in the critiques. If experts regularly use terms like “depth” and “psychological tension,” the model picks up on the fact that these patterns align with expert-level writing.
This drawback becomes clearer when two critiques convey the same ideas using entirely different wording. Logistic Regression performs best when similar ideas are phrased with similar vocabulary. If the word choice shifts too much, the model may fail to see the connection between the critiques. This issue pushed researchers toward embedding-based approaches that aim to capture meaning rather than just matching words.
C.3 Method 3 – Embedding-Based Semantic Search
The next major leap in semantic search moves past TF-IDF and basic word counting. Rather than encoding text as word frequencies, modern systems rely on dense semantic embeddings produced by transformer-based language models.
This is the point where the system begins moving past surface-level vocabulary and starts grasping actual meaning. Two critiques may use very different wording to express an artistic concept, and yet they are still identified as similar.
To produce the embeddings, we use a Sentence Transformer model from the Hugging Face ecosystem. Sentence Transformers convert entire sentences or documents into dense numerical vectors. These vectors are built to capture the meaning of the text and the connections between different pieces of writing.
For instance, phrases like:
“dramatic use of light”
“careful illumination”
“strong chiaroscuro effects”
appear quite different on the surface, but they communicate closely related artistic ideas. Unlike TF-IDF, embedding models can frequently detect these semantic links. Unlike the Logistic Regression model from Method 2, the embedding model does not attach explicit weights to individual words like “contrast” or “psychological.” Instead, semantic information gets spread across many dimensions of the embedding space. This makes the representations harder to interpret at a glance, but also far more flexible in capturing meaning.
For Method 3, we introduce a fresh set of critiques aimed at finding semantic similarity at a deeper level. Some critiques employ highly technical vocabulary, while others express similar artistic ideas in a more conversational or roundabout manner. This creates a tougher retrieval challenge because critiques may discuss related concepts without sharing many of the same keywords.
After producing embeddings for all critiques, we calculate cosine similarity directly within the embedding space. Each critique embedding generated by
Each Sentence Transformer is encoded as a dense numerical vector with 384 dimensions, matching the number of learned features.
Similarity is calculated in two ways: (a) between every student critique and every expert critique, and (b) between each student critique and an expert centroid (Table 2). This centroid vector is derived by averaging the corresponding components of all expert critique embeddings. As a result, the centroid also has 384 dimensions. In essence, this centroid captures the approximate semantic “center” of expert-level critiques and serves as a benchmark for measuring how closely a student critique aligns with expert writing within the embedding space.
| STUDENT CRITIQUE NAME AND TITLE | EXPERT CENTROID-LIKENESS SCORE |
| S1-Drama Through Light and Response | 0.802 |
| S4-Emotional Response | 0.618 |
| S5-Formal Analysis Attempt | 0.765 |
| S6-General Impression | 0.75 |
| S7-Symbolic Interpretation | 0.73 |
To better interpret the embedding space, we also visualize the embeddings using PCA (Figure 2). PCA compresses the high-dimensional embeddings into two dimensions while retaining much of their semantic information.

The PCA visualization uncovers several noteworthy patterns. Student Critique S1 is positioned near Expert Critiques E1 and E2. This is logical, as they all explore related themes such as light, shadow, mood, and dramatic significance.
Student Critique S7 is similarly positioned close to Expert Critique E3. Both critiques address symbolism, emotion, and the deeper meaning conveyed by the painting. Despite using different terminology, they convey comparable ideas.
The PCA visualization also reveals that student and expert critiques do not form completely distinct clusters. Some student critiques are surprisingly near expert critiques, particularly when they address similar artistic themes. Conversely, weaker or more generic critiques tend to be located farther from the expert region within the embedding space.
The Expert-Likeness Scores (Table 2) are consistent with the PCA visualization. S1 achieves the highest score (0.802) and is located near expert critiques E1 and E2, indicating that S1 most closely resembles the expert critiques. S5 (0.765) and S6 (0.75) also earn relatively high scores. In the visualization, they are positioned near each other and moderately close to the expert critiques.
S7 receives a moderate score (0.73) yet is located very close to E3, as both critiques focus on symbolism, emotion, and deeper meaning. S4 obtains the lowest score (0.618) and is also positioned farther from the expert critiques in the visualization. This critique emphasizes personal feelings rather than detailed artistic analysis.
At this point, despite the shift from basic keyword matching to meaning-based analysis, the embeddings remain static. The next phase introduces transformer models capable of adapting their understanding based on surrounding context.
C.4 Method 4-Fine-Tuned Transformer Models
The final phase introduces fine-tuned transformer models. In Method 3, a Sentence Transformer was used to compare critiques based on semantic similarity. Here, we take it a step further by training the model directly on labeled expert and novice critiques.
Specifically, we fine-tune a pretrained DistilBERT model from the Hugging Face Transformers library. DistilBERT is a compact and efficient variant of BERT. It was designed to capture many of the same language patterns as the original BERT model while requiring fewer parameters. DistilBERT was developed through a technique called knowledge distillation. Despite being lighter and more efficient to train, it still delivers strong performance across many NLP tasks.
In our Method 4, rather than learning language from scratch, the model (DistilBERT) begins with knowledge acquired from vast amounts of text and then adapts to our critique-classification task. This approach is known as transfer learning. Transformers also leverage attention mechanisms that enable the model to capture relationships between words within a sentence.
The training pipeline consists of:
- tokenizing critiques into transformer-compatible inputs
- fine-tuning the pretrained model on labeled critiques
- generating class probabilities for each critique
Let us examine the code snippet from Method 4, shown below.
#Load Tokenizer
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint
)
#Tokenize Text
def tokenize_function(example):
return tokenizer(
example["text"],
truncation=True,
padding="max_length",
max_length=128
)
tokenized_dataset = dataset.map(tokenize_function)The tokenizer created with AutoTokenizer.from_pretrained() is used within tokenize_function() through the line tokenizer(example["text"], ...).
In transformer-based NLP, the tokenizer is not merely a tokenizer. It carries out several preprocessing steps simultaneously:
- it breaks the text into tokens
- converts the tokens into numerical token IDs using the model’s vocabulary
- adds special transformer tokens
- truncates long sequences
- pads shorter sequences to a fixed length
- creates attention masks. The resulting numerical representation is what the transformer model subsequently uses as input for training and prediction.
The argument truncation=True ensures that overly long critiques are trimmed to a maximum length. The argument padding="max_length" pads shorter critiques with zeros so that all input sequences share the same fixed length (128 tokens). Finally, dataset.map(tokenize_function) applies this tokenization process to every example in the dataset, producing a transformer-ready dataset for training.
Unlike the embedding-based approach of Method 3, this method performs explicit supervised classification. For each critique, the model predicts both:
- a class label
- a confidence score for that label
- a confidence rating for each category
To illustrate, let us look at this sample evaluation:
“The positioning of the figures and the deliberate application of shadow generate psychological tension and symbolic ambiguity across the entire composition.”
On the surface, this evaluation appears fairly refined because it employs specialized artistic terminology, including:
- “psychological tension”
- “symbolic ambiguity”
- “composition”
A more basic technique like TF–IDF would likely assign high scores to these terms since they commonly appear in expert evaluations. Essentially, TF–IDF simply detects that the evaluation includes significant vocabulary linked to art analysis.
In contrast, the transformer model goes beyond individual keywords. It examines how concepts relate to one another throughout the sentence and whether the evaluation demonstrates deeper reasoning. While the evaluation uses impressive terminology, the analysis remains brief and somewhat vague. It mentions psychological tension and symbolism but fails to elaborate on them in any meaningful depth. When measured against expert evaluations, its reasoning is noticeably less thorough.
Following 100 training epochs, the transformer accurately categorized the evaluation as novice-level:
Predicted label: 0
Confidence: 0.685
Probability novice-like: 0.685
Probability expert-like: 0.315Notably, when the model underwent training for merely 30 epochs, the identical evaluation was flagged as expert-like. This suggests that in earlier stages of training, the model most likely leaned heavily on impressive vocabulary. Further training enabled it to prioritize wider contextual and analytical patterns over surface-level keywords alone.
A key challenge worth mentioning with transformer fine-tuning is that these models typically demand substantial quantities of training data. Our educational dataset consists of only a limited number of evaluations. Given that transformer architectures house millions of trainable parameters, they commonly require significantly larger datasets to achieve reliable generalization.
With prolonged training across many epochs, the model steadily grows more confident in its predictions. However, when working with a limited dataset, a portion of this confidence could reflect rote memorization of stylistic patterns encountered during training instead of authentic language comprehension. This issue is referred to as overfitting and is particularly prevalent when large transformer models are trained on restricted data.
This scenario underscores both the advantages and drawbacks of transformer models. They are capable of grasping meaning that goes beyond basic keyword matching, but they can also grow excessively confident when training data is limited.
This concluding stage wraps up the journey spanning:
- clear heuristic evaluation
- traditional machine learning
- semantic embeddings
- context-driven language comprehension using transformers
Combined, these four approaches reflect the wider progression of semantic search and modern NLP progress: transitioning from handcrafted features toward progressively advanced learned representations of meaning and context.
D. Discussion
The four approaches presented in this piece trace how keyword matching has transformed into contextual language comprehension.
The first approach, TF-IDF combined with rule-based scoring, was straightforward and easy to interpret. We could readily identify why one evaluation outranked another. Still, this technique relied considerably on precise word selection and often overlooked the underlying meaning.
The second approach implemented Logistic Regression on TF-IDF features. Instead of hand-coding rules, the system identified patterns from labeled evaluations. By inspecting the learned coefficients, we can identify which words appear more frequently in expert evaluations versus novice ones. Logistic Regression detects these patterns from the TF-IDF word vectors. As noted earlier, the system does not genuinely comprehend context or meaning. Despite this, it can still deliver surprisingly strong performance when specific words or phrases strongly align with certain writing styles.
The third approach brought embeddings through Sentence Transformers. This marked a significant leap because evaluations could now be compared based on conceptual meaning rather than literal wording. Evaluations addressing similar artistic concepts frequently appeared near each other in embedding space, even when expressed differently.
A key finding from Method 3 was that evaluation quality is not always definitive. Some student evaluations turned out to be semantically close to expert evaluations despite retaining a novice-level label. In this approach, the Sentence Transformer serves primarily as a pretrained semantic embedding model. The transformer itself is not retrained. Instead, every evaluation is transformed into a dense semantic vector, and similarity is assessed using cosine similarity within embedding space.
Lastly, in Method 4, we introduced the fine-tuned transformer model. This model incorporated contextual language comprehension via DistilBERT. Both Method 2 and Method 4 follow supervised learning paradigms since they learn from labeled examples. Yet they learn in fundamentally different ways. Logistic Regression functions on fixed TF-IDF features, derived from word and phrase frequencies. Meanwhile, transformers acquire contextual representations by examining word relationships, sentence construction, and overall meaning.
A critical distinction is that although both Method 3 and Method 4 leverage transformer architectures, they apply them differently. In Method 3, the transformer is employed mainly as a pretrained embedding generator for semantic similarity. In Method 4, the transformer is fine-tuned directly on the labeled evaluation dataset. During training, the model modifies its internal parameters to learn how to distinguish expert evaluations from novice ones. Rather than simply functioning as a feature extractor, the transformer itself becomes the classifier. This marks a crucial conceptual leap from semantic similarity matching to supervised task-specific learning.
The findings also revealed one of the primary challenges of transformer fine-tuning: the fact that large models typically demand substantially more training data. When only a small dataset is available, the model may memorize training examples too precisely and struggle to generalize effectively to unseen data.
Overall, we explored the various approaches in a stepwise fashion, illustrating how different NLP models represent meaning in distinct ways. In particular, TF-IDF centers mainly on prominent words, embedding models emphasize semantic similarity, and transformers attempt to comprehend language through contextual and word relationships.
E. Conclusion
In this article, we examined four hands-on approaches to semantic search, progressing from traditional TF-IDF retrieval to contemporary transformer models. Through the example of student and expert painting evaluations, we explored how different NLP techniques represent language and assess similarity.
The findings revealed that each approach offers distinct strengths and limitations. Traditional methods remain simple, quick, and transparent. Semantic embedding models effectively capture conceptual similarity even with modest datasets. Transformers deliver deeper contextual comprehension but usually need more labeled data to achieve reliable generalization.
A pivotal insight was that conceptual understanding exists along a spectrum. Some student evaluations resembled expert evaluations closely despite not reaching expert quality.
Contemporary NLP systems are growing more adept at grasping meaning, context, and interconnections between ideas. Yet, the core objective stays unchanged: enabling machines to better comprehend human language.
The source code for the approaches outlined above is available at:
The synthetic data (evaluations) is embedded within the code.
Note: All figures and plots were generated by the author.
Thank you for reading!



