# Introduction
Natural language processing (NLP) has clearly shifted in recent years, with large language models (LLMs) and transformers taking on complex end-to-end understanding tasks. Yet in any real-world NLP workflow, raw text still needs to be tokenized, normalized, and analyzed before it ever reaches a model. While modern NLP libraries and ecosystems like SpaCy or Hugging Face are excellent for building general-purpose deep learning pipelines or integrating with LLMs, the Natural Language Toolkit (NLTK) remains a practical, transparent choice for detailed structural linguistics, custom text normalization, and statistical corpus analysis.
Unfortunately, many developers mistakenly assume that LLMs make traditional text preprocessing unnecessary, or they write preprocessing code using naive methods that throw away important linguistic structure. They break apart multi-word expressions like “machine learning” into separate, meaningless words; they apply context-blind lemmatization that produces incorrect base forms; or they depend on simple raw frequency counts that overlook meaningful word associations.
To build robust, semantically accurate NLP models, you need to preserve structural and linguistic context during the preprocessing stage. In this article, we will walk through three essential NLTK techniques to level up your text preprocessing:
- preserving phrase integrity with the
MWETokenizer - context-aware lemmatization with Part-of-Speech (POS) mapping
- statistical collocation extraction using association measures
# 1. Preserving Domain Terminology with the Multi-Word Expression Tokenizer
Tokenization is the foundation of any NLP pipeline. However, standard tokenizers split sentences strictly by whitespace and punctuation. This becomes problematic when dealing with domain-specific multi-word expressions — such as "neural network", "decision tree", or "San Francisco" — where the individual words combine to form a single semantic concept.
If a tokenizer splits "neural network" into "neural" and "network", a downstream vectorizer (like Bag-of-Words or TF-IDF) will treat them as unrelated features, diluting the signal and introducing noise. Developers often try to fix this by writing search-and-replace regular expressions on the raw text before tokenizing.
Using character-level replacements (e.g. text.replace("neural network", "neural_network")) is brittle. It fails to respect word boundaries, handles punctuation poorly, and is incredibly slow to execute across large datasets. The optimized approach is to tokenize the text first and then run NLTK’s native MWETokenizer to merge these tokens cleanly.
The naive approach of regex replacement relies on character-level string manipulation, which does not scale well and can inadvertently modify substrings inside unrelated words:
import re
import time
# Sample corpus
raw_texts = [
"We are studying neural networks and deep learning.",
"The decision tree is a popular model in machine learning.",
"A neural network can have many layers."
] * 5000
cleaned_texts = []
for text in raw_texts:
# Manual string replacements for domain terms
text = re.sub(r"bneural networks?b", "neural_network", text, flags=re.IGNORECASE)
text = re.sub(r"bdecision trees?b", "decision_tree", text, flags=re.IGNORECASE)
text = re.sub(r"bmachine learnings?b", "machine_learning", text, flags=re.IGNORECASE)
# Tokenize the processed string
tokens = text.lower().split()
cleaned_texts.append(tokens)
print("Sample tokens:", cleaned_texts[0])Output:
Sample tokens: ['we', 'are', 'studying', 'neural_network', 'and', 'deep', 'learning.']Now let’s try using NLTK’s tokenizers. We first tokenize using the standard word_tokenize method and then pass the token streams through an initialized MWETokenizer that handles merging on token boundaries efficiently:
import nltk
from nltk.tokenize import word_tokenize, MWETokenizer
import time
# Ensure NLTK resources are downloaded
nltk.download('punkt', quiet=True)
raw_texts = [
"We are studying neural networks and deep learning.",
"The decision tree is a popular model in machine learning.",
"A neural network can have many layers."
] * 5000
# Initialize tokenizer and register MWE tuples
mwe_tokenizer = MWETokenizer([
('neural', 'network'),
('neural', 'networks'),
('decision', 'tree'),
('decision', 'trees'),
('machine', 'learning')
], separator="_")
cleaned_texts_mwe = []
for text in raw_texts:
# Tokenize words using NLTK's standard tokenizer
tokens = word_tokenize(text.lower())
# Merge specified multi-word expressions
merged_tokens = mwe_tokenizer.tokenize(tokens)
cleaned_texts_mwe.append(merged_tokens)
print("Sample tokens:", cleaned_texts_mwe[0])We get the same output, but in a more elegant and linguistically-accurate — and scalable — approach:
Sample tokens: ['we', 'are', 'studying', 'neural_network', 'and', 'deep', 'learning.']Using the MWETokenizer shifts the operation from slow character-level string matches to token-level comparison.
- We define the multi-word expressions as tuples of independent tokens:
('neural', 'network'). - By setting
separator="_", the tokenizer merges the matching sequence into a single string token:"neural_network". - Because it acts directly on token arrays, it is immune to boundary matching bugs and handles trailing punctuation (like
"neural networks."splitting into"neural","networks","."first, then safely merging to"neural_networks",".") correctly. It executes faster and scales cleanly to hundreds of domain terms.
# 2. Context-Aware Lemmatization with POS-Tag Mapping
# 3. Statistical Phrase Extraction using Collocation Finders
Identifying key phrases or multi-word expressions within text is crucial for tasks like topic discovery, search optimization, and sentiment detection. These multi-word units, known as collocations, represent combinations of words that appear together more frequently than random probability would suggest.
A straightforward method for spotting collocations involves tallying all raw bigrams (pairs of consecutive words) and ranking them by how often they appear. Unfortunately, this basic strategy produces mostly unhelpful results. Because of natural language frequency patterns, generic pairs like “of the”, “in the”, and “on a” will inevitably dominate the top spots. Even when you remove common stopwords, simple frequency counts can still elevate arbitrary, coincidental word pairs that happen to recur a handful of times.
The refined approach leverages NLTK’s BigramCollocationFinder along with statistical association measures. Rather than relying on raw occurrence counts, you apply metrics such as Pointwise Mutual Information (PMI) or the Chi-Square test. These techniques assess whether two words co-occur significantly more than pure randomness would predict.
Below, the basic method simply tallies raw bigrams and picks the highest-frequency pairs, capturing plenty of noise and filler words:
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import bigrams
# Sample corpus
corpus = """
Natural language processing is an active field of AI. Machine learning plays a key role
in natural language processing. Deep learning architectures have revolutionized natural
language processing. We need machine learning models to solve these natural language tasks.
"""
tokens = word_tokenize(corpus.lower())
# Extract and count raw bigrams
raw_bigrams = list(bigrams(tokens))
bigram_counts = Counter(raw_bigrams)
print("Top 5 Raw Bigrams:")
for bigram,freq in bigram_counts.most_common(5):
for bigram, freq in bigram_counts.most_common(5):
print(f"{bigram}: {freq}")
Output:
Top 5 Raw Bigrams:
('natural', 'language'): 4
('language', 'processing'): 3
('machine', 'learning'): 2
('processing', '.'): 2
('processing', 'is'): 1At this point we set up NLTK's collocation discovery mechanism, impose filtering conditions, and leverage the BigramAssocMeasures class to evaluate phrase associations through Pointwise Mutual Information (PMI):
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics.association import BigramAssocMeasures
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
corpus = """
Natural language processing is an active field of AI. Machine learning plays a key role
in natural language processing. Deep learning architectures have revolutionized natural
language processing. We need machine learning models to solve these natural language tasks.
"""
tokens = word_tokenize(corpus.lower())
# Set up the collocation discovery mechanism
finder = BigramCollocationFinder.from_words(tokens)
# Remove punctuation and stop words from consideration
stop_words = set(stopwords.words('english'))
filter_stops = lambda w: w in stop_words or not w.isalnum()
finder.apply_word_filter(filter_stops)
# Discard bigrams appearing fewer than N times
finder.apply_freq_filter(2)
# Evaluate bigrams using pointwise mutual information
pmi_measures = BigramAssocMeasures()
top_collocations = finder.score_ngrams(pmi_measures.pmi)
print("Top Collocations by PMI:")
for bigram, pmi_score in top_collocations[:5]:
# Build a readable output representation
phrase = " ".join(bigram)
print(f"Phrase: {phrase:<30} | PMI Score: {pmi_score:.4f}")Output:
Top Collocations by PMI:
Phrase: machine learning | PMI Score: 3.8074
Phrase: language processing | PMI Score: 3.3923
Phrase: natural language | PMI Score: 3.3923BigramCollocationFinder.from_words()pulls out all two-word combinations while keeping track of their positional information.- We refine the candidate list using
finder.apply_word_filter(), which dynamically screens out bigrams that include stop words or punctuation, all without altering the surrounding word context. - With
apply_freq_filter(2), we rule off one-off random pairings, cutting down on statistical noise. - Lastly, scoring via pointwise mutual information quantifies how likely two words co-occur relative to how often each appears independently. This surfaces tightly coupled terms like "machine learning" and "natural language" while filtering out common but loosely associated pairings.
# Wrapping Up
Tailored text preprocessing is essential for distilling cleaner signals from unstructured text, and NLTK supplies the structural building blocks needed to customize these operations.
By weaving these three NLTK methods into your workflow, you can construct far more resilient NLP pipelines:
- Safeguarding domain-specific terminology with
MWETokenizerfuses multi-word expressions at the token level, ensuring that critical concepts remain intact during vectorization - Context-sensitive lemmatization pairs POS tag generation with WordNet lookups to recover linguistically precise root forms, substantially shrinking vocabulary size
- Statistical collocation extraction harnesses mathematical association measures like PMI to pinpoint genuine semantic phrases from raw text, sidestepping the distortions of simple frequency-based counts
Embedding these structural strategies into your feature engineering pipeline guarantees that downstream classification, search, and clustering models receive high-fidelity, semantically coherent tokens.
Matthew Mayo (@mattmayo13) holds a master's degree in computer science and a graduate diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.



