# Classical NLP Approaches for Author Identification: Lessons from the Spooky Author Classification Challenge
Author identification is a good way to test NLP models because it focuses not only on *what* a sentence says, but also on *how* it is written. Kaggle’s **Spooky Author Identification** competition is a compact version of this challenge: given a single sentence from gothic or horror fiction, the model has to predict whether it was written by **Edgar Allan Poe (EAP)**, **Mary Wollstonecraft Shelley (MWS)**, or **H. P. Lovecraft (HPL)**.
At first, this seems like a typical three-class text classification problem. But in reality, it is more complex. The authors all write about similar themes: fear, mystery, death, atmosphere, and the supernatural. Simple keywords are not enough to tell them apart. Instead, the important clues are often stylistic: function words, punctuation, character patterns, short phrases, sentence rhythm, and the way each author builds a sentence.
This made the project a good way to explore a specific question:
> ***How far can classical NLP go when we choose representations carefully and evaluate them honestly?***
I approached the task by building a sequence of increasingly capable classical models:
1. a fast Vowpal Wabbit word baseline,
2. a richer VW model with punctuation and character n-grams,
3. a tuned TF-IDF ensemble,
4. a stacked sparse-text ensemble using out-of-fold predictions,
5. a small representation survey comparing sparse features, BM25, Word2Vec, and FastText.
The goal was not only to improve the score, but also to understand which representations helped, which metrics improved, and which evaluation setup each result came from.
> This article focuses on the project’s methodology, results, and interpretation. I’ll go over the main implementation choices and share the key code snippets, but I won’t include every line from the notebook. The complete executed notebook, including the full implementation and outputs, is available in the GitHub repository linked at the end.
## Dataset and Evaluation Setup
The dataset contains **19,579 labeled training sentences** and **8,392 unlabeled test sentences**. The class distribution is mildly imbalanced:
**Figure 1. Class distribution in the training set.** The dataset is mildly imbalanced, with EAP making up the largest share of examples and HPL the smallest.
I encoded the labels as 1-based integers because Vowpal Wabbit’s One-Against-All multiclass mode expects labels starting at 1.
“`python
train_texts = pd.read_csv(DATA_DIR / “train.csv”, index_col=”id”)
test_texts = pd.read_csv(DATA_DIR / “test.csv”, index_col=”id”)
AUTHOR_CODE = {“EAP”: 1, “MWS”: 2, “HPL”: 3}
train_texts[“author_code”] = train_texts[“author”].map(AUTHOR_CODE)
print(f”Train: {len(train_texts)} sentences Test: {len(test_texts)} sentences”)
print(train_texts[“author”].value_counts(normalize=True).round(3))
“`
To compare models locally, I used a single stratified 70/30 train-validation split with a fixed random seed. This kept the class proportions stable and ensured that every model was evaluated on the same held-out examples.
“`python
train_texts_part, valid_texts = train_test_split(
train_texts,
test_size=0.3,
random_state=17,
stratify=train_texts[“author_code”]
)
y_part = train_texts_part[“author_code”].values
y_valid = valid_texts[“author_code”].values
“`
I focused on three main metrics:
– **Accuracy:** straightforward to understand, but it only measures the final top-class decision.
– **Macro-F1:** useful for checking whether performance is balanced across the three authors.
– **Multiclass log loss:** the official Kaggle metric and the most important metric for this project, because it evaluates the quality of the predicted probabilities, not just the predicted class.
Log loss rewards confident correct predictions and heavily penalizes confident wrong predictions. This matters in a competition where the submission is a probability distribution over EAP, HPL, and MWS.
## 1. Word-only Vowpal Wabbit Baseline
I started with Vowpal Wabbit because it is fast, handles sparse data well, and is well-suited to linear text models. VW trains online linear models, hashes features into a fixed feature space, and handles multiclass classification through One-Against-All.
For the first baseline, I used only lowercased word features of length three or more.
“`python
def to_vw_words(df, is_train=True):
“””VW line: ‘
for i in range(len(df)):
label = df[“author_code”].iloc[i] if is_train else 1
text = df[“text”].iloc[i].lower().replace(“|”, “”).replace(“:”, “”)
words = ” “.join(re.findall(r”w{3,}”, text))
lines.append(f”{label} |text {words}n”)
return lines
“`
One implementation detail that mattered was how VW handles multiple passes. When VW reads a file directly, options such as `passes` and `cache` behave as expected. When feeding examples manually through the Python API, I had to loop over the file myself.
“`python
N_PASSES = 10
vw = Workspace(
oaa=3,
loss_function=”logistic”,
ngram=2,
b=28,
quiet=True,
final_regressor=f”{OUTPUT_DIR}/spooky_words.vw”
)
for _ in range(N_PASSES):
with open(f”{OUTPUT_DIR}/train_words.vw”) as f:
for line in f:
vw.learn(line)
vw.finish()
“`
On the 70/30 holdout split, the word-only VW baseline reached:
**Holdout performance of the word-only Vowpal Wabbit baseline.** Even with simple word and bigram features, the fast linear VW model provides a strong starting point.
This was already a strong result for a fast linear model using simple word and bigram features. It also established a useful baseline: any added representation or ensemble layer needed to clear this bar.
## 2. Rich VW: Adding Style-Aware Features
The word-only baseline proved that length bigrams help introduce style sensitivity, but I found that adding character n-grams and punctuation awareness further improved the model. The revised feature set included lowercased words of three or more characters, character n-grams for both words and punctuation, dense continuous carrier phrases, and punctuation frequency counts. This enriched representation captured the stylistic fingerprints of each author more effectively, effectively shifting the VW model into a careful style-aware role and further improving the local validation score.## Building Rich Feature Representations for Authorship Attribution
Authorship attribution goes beyond simple topic classification. A model needs access to cues that reflect writing style, not just word choice. This article explores two powerful approaches to capturing stylistic signals: a multi-namespace setup using Vowpal Wabbit and a TF-IDF pipeline with multiple linear models.
### A Richer Vowpal Wabbit Model
For a richer VW model, the input was separated into three distinct namespaces, each capturing a different layer of textual information:
– `|w` for words, including short function words
– `|p` for punctuation
– `|c` for character n-grams
The character n-gram extraction was boundary-aware, meaning whitespace and text edges were replaced with underscores to preserve positional information:
“`python
def char_ngrams(text, ns=(2, 3, 4)):
“””Boundary-aware character n-grams; whitespace/edges become ‘_’.”””
t = “_” + re.sub(r”s+”, “_”, text.strip()) + “_”
return [t[i:i + n] for n in ns for i in range(len(t) – n + 1)]
def to_vw_rich(df, is_train=True, char_ns=(2, 3, 4)):
“””Three namespaces: |w words, |p punctuation, |c character n-grams.”””
lines = []
texts = df[“text”].values
labels = df[“author_code”].values if is_train else None
for i, text in enumerate(texts):
safe = str(text).lower().replace(“|”, ” “).replace(“:”, ” “)
label = labels[i] if is_train else 1
words = ” “.join(re.findall(r”w+”, safe))
punct = ” “.join(re.findall(r”[^ws]”, safe))
chars = ” “.join(char_ngrams(safe, ns=char_ns))
lines.append(f”{label} |w {words} |p {punct} |c {chars}n”)
return lines
“`
This model used more training passes and a slightly larger hash space than the word-only baseline:
“`python
N_PASSES = 15
vw = Workspace(
oaa=3,
loss_function=”logistic”,
ngram=2,
b=29,
quiet=True,
final_regressor=f”{OUTPUT_DIR}/spooky_rich.vw”
)
for _ in range(N_PASSES):
with open(f”{OUTPUT_DIR}/train_rich.vw”) as f:
for line in f:
vw.learn(line)
vw.finish()
“`
The improvement was meaningful. Adding punctuation and character-level structure helped the model capture style beyond plain word choice, improving both accuracy and Macro-F1 over the word-only VW baseline.
### TF-IDF Word and Character Features
To determine whether another classical sparse-text pipeline could match or exceed the VW results, a TF-IDF feature matrix was built using two complementary views of the text:
1. Word-level unigrams and bigrams
2. Character-level 2-to-5-grams inside word boundaries
“`python
CLASSES = np.array([1, 2, 3]) # 1=EAP, 2=MWS, 3=HPL
def build_tfidf(fit_texts):
word_vectorizer = TfidfVectorizer(
sublinear_tf=True,
ngram_range=(1, 2),
min_df=2
).fit(fit_texts)
char_vectorizer = TfidfVectorizer(
sublinear_tf=True,
analyzer=”char_wb”,
ngram_range=(2, 5),
min_df=2
).fit(fit_texts)
return word_vectorizer, char_vectorizer
def tfidf_features(word_vectorizer, char_vectorizer, texts):
X_word = word_vectorizer.transform(texts)
X_char = char_vectorizer.transform(texts)
return sp.hstack([X_word, X_char]).tocsr()
“`
The word features capture vocabulary and phrase-level evidence, while the character features capture spelling fragments, suffixes, prefixes, punctuation-adjacent patterns, and other small details useful for style classification.
Three complementary models were trained on this representation: Logistic Regression, NB-SVM-style Logistic Regression, and Complement Naive Bayes. For Logistic Regression and the NB-SVM-style model, the `C` values were tuned with inner cross-validation on the training split only, leaving the holdout set untouched:
“`python
def tune_lr_C(X, y, C_grid=(0.1, 0.3, 1, 3, 10, 30), n_splits=5):
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
rows = []
for C in C_grid:
oof = np.zeros((X.shape[0], len(CLASSES)))
for tr_idx, va_idx in cv.split(X, y):
clf = LogisticRegression(C=C, max_iter=3000)
clf.fit(X[tr_idx], y[tr_idx])
oof[va_idx] = align_proba(clf, X[va_idx])
rows.append({“C”: C, “log_loss”: log_loss(y, oof, labels=CLASSES)})
return pd.DataFrame(rows)
“`
Inner cross-validation showed that NB-SVM-style Logistic Regression achieved a lower inner-CV log loss, suggesting a stronger tuned linear component. The final 3-model probability average produced strong accuracy and a competitive log loss on the 70/30 holdout split. While the accuracy gain over the rich VW model was modest, the log loss was notably strong — an important improvement since Kaggle evaluates probability distributions.
### NB-SVM-style Logistic Regression
The NB-SVM-style model deserves special attention as a simple yet effective classical text-classification technique. The idea is to compute a per-feature log-count ratio — how much more often a feature appears in one class than in the others — and then multiply each feature by this ratio before fitting a linear classifier:
“`python
def nbsvm_proba(X_train, y_train, X_test, C=10):
probas = []
for cls in CLASSES:
y_binary = (y_train == cls).astype(int)
p = X_train[y_binary == 1].sum(axis=0) + 1
q = X_train[y_binary == 0].sum(axis=0) + 1
r = np.log((p / p.sum()) / (q / q.sum()))
r = np.asarray(r).ravel()
clf = LogisticRegression(C=C, max_iter=3000)
clf.fit(X_train.multiply(r), y_binary)
probas.append(clf.predict_proba(X_test.multiply(r))[:, 1])
proba = np.vstack(probas).T
proba = np.clip(proba, 1e-15, 1 – 1e-15)
return proba / proba.sum(axis=1, keepdims=True)
“`
Despite the name, this implementation is not a pure SVM. It uses Logistic Regression trained on Naive-Bayes-weighted sparse features. The key benefit is that features strongly associated with a specific author are amplified before the linear model is trained, giving the classifier a head start in identifying discriminative stylistic patterns.
—
*This article was based on the original post: [Authorship Attribution with Vowpal Wabbit and TF-IDF Features](https://contributor.insightmediagroup.io/)*# Stacking with Out-of-Fold Predictions and Final Submission Strategy
After building a strong TF-IDF ensemble, the next logical step was to combine multiple base models in a way that could learn optimal weighting rather than relying on a simple flat average. A flat average treats every model equally, but there is no reason to assume every model is equally reliable for every class. Stacking addresses this by training a second-level model — a meta-learner — that learns how to best combine the base model predictions.
## Avoiding Leakage with Out-of-Fold Predictions
The primary risk when building a stacked ensemble is data leakage. If the meta-learner is trained on predictions from base models that have already seen the same training examples, it will overfit and produce overly optimistic estimates. To prevent this, out-of-fold (OOF) predictions were used.
The process works as follows:
– **For training examples**, each base model predicts only the examples in a fold that it was not trained on. This ensures that every training example receives a prediction from a version of the model that never saw it during training.
– **For holdout or test examples**, predictions are averaged across all fold-trained versions of each base model. This reduces variance and produces more stable probability estimates.
## Base Models and Hyperparameter Grades
Five base models were selected for the stacking ensemble, each with its own hyperparameter search grid:
“`python
BASE_MODELS = [“lr”, “nbsvm”, “cnb”, “mnb”, “sgd”]
BASE_PARAM_GRIDS = {
“lr”: {“C”: [1, 3, 10, 30]},
“nbsvm”: {“C”: [1, 3, 10, 30]},
“cnb”: {“alpha”: [0.1, 0.3, 0.5, 1.0]},
“mnb”: {“alpha”: [0.1, 0.3, 0.5, 1.0]},
“sgd”: {“alpha”: [1e-6, 3e-6, 1e-5, 3e-5]},
}
“`
These models span a range of approaches — from Logistic Regression and Naive Bayes SVM variants to Complement Naive Bayes, Multinomial Naive Bayes, and SGD-based classifiers — providing the meta-learner with diverse probability signals to combine.
## Building the Stacking Feature Matrix
The stacking feature builder creates a matrix with one block of probability columns per base model. With five base models and three author classes, the meta-learner receives 15 probability features per example. The implementation uses stratified k-fold cross-validation to generate both OOF training features and averaged test features:
“`python
def build_stack_features(X_train, y_train, X_test, best_params_by_model,
n_folds=5, seed=17):
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
n_classes = len(CLASSES)
n_models = len(BASE_MODELS)
oof_stack = np.zeros((X_train.shape[0], n_classes * n_models))
test_stack = np.zeros((X_test.shape[0], n_classes * n_models))
for j, kind in enumerate(BASE_MODELS):
start = j * n_classes
end = start + n_classes
params = best_params_by_model[kind]
for tr_idx, va_idx in skf.split(X_train, y_train):
oof_stack[va_idx, start:end] = base_proba(
kind,
X_train[tr_idx],
y_train[tr_idx],
X_train[va_idx],
params
)
test_stack[:, start:end] += base_proba(
kind,
X_train[tr_idx],
y_train[tr_idx],
X_test,
params
) / n_folds
return oof_stack, test_stack
“`
## Tuning the Meta-Learner
A Logistic Regression model was chosen as the meta-learner, and its regularization parameter `C` was tuned using cross-validation on the stacked probability features:
“`python
def tune_meta_C(oof_stack, y, C_grid=(0.03, 0.1, 0.3, 1, 3, 10, 30)):
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
for C in C_grid:
oof_meta = np.zeros((oof_stack.shape[0], len(CLASSES)))
for tr_idx, va_idx in skf.split(oof_stack, y):
meta = LogisticRegression(C=C, max_iter=3000)
meta.fit(oof_stack[tr_idx], y[tr_idx])
oof_meta[va_idx] = align_proba(meta, oof_stack[va_idx])
print(C, log_loss(y, oof_meta, labels=CLASSES))
“`
## Holdout Results
On the 70/30 holdout split, the best base-model hyperparameters were recorded, and the best meta-learner setting was `C=3`. The stacked model achieved the lowest holdout log loss among all classical pipelines tested in the project.
The most significant improvement was not in raw accuracy but in log loss — meaning the ensemble improved the quality of the probability estimates themselves. Since the Kaggle competition metric rewards well-calibrated probabilities, this was exactly the kind of improvement that mattered.
## Final Full-Data Refit and Kaggle Submission
For the final submission, the entire pipeline was refit on the full labeled training data. This involved:
1. Refitting the TF-IDF representation on all training data
2. Rebuilding the stacking features with the expanded dataset
3. Retuning the base models on the full data
4. Training the final meta-learner
5. Generating predictions for the test set
On the full training data, the best base-model parameters shifted slightly compared to the 70/30 holdout, and the best meta-learner setting was `C=30`. The code also explicitly mapped the internal class order `[1, 2, 3] = [EAP, MWS, HPL]` into Kaggle’s required submission column order: `EAP`, `HPL`, `MWS`.
“`python
meta_final = LogisticRegression(C=best_full_meta_C, max_iter=3000)
meta_final.fit(oof_full, y_full)
proba_test = align_proba(meta_final, test_stack)
proba_test = np.clip(proba_test, 1e-15, 1 – 1e-15)
proba_test = proba_test / proba_test.sum(axis=1, keepdims=True)
submission = pd.DataFrame({
“id”: test_texts.index,
“EAP”: proba_test[:, 0], # class 1
“HPL”: proba_test[:, 2], # class 3
“MWS”: proba_test[:, 1], # class 2
})
submission.to_csv(OUTPUT_DIR / “spooky_submission.csv”, index=False)
“`
The full-data level-2 OOF estimate for the meta-learner served as a useful sanity check, though it is not directly comparable to the earlier 70/30 holdout results because it comes from a different evaluation setup — it evaluates the meta-learner using out-of-fold stacking features over the full training data, not a fully nested cross-validation of the entire pipeline.
On the Kaggle leaderboard, the final stacked model achieved a strong score, confirming that the stacking approach with out-of-fold predictions was the most effective strategy among the classical machine learning pipelines explored in the project.
—
*Original article: [Stacking with Out-of-Fold Predictions and Final Submission](https://contributor.insightmediagroup.io/)*# Authorship Attribution with Stacked Models: A Detailed Walkthrough
## 1. Validation Setup and Reliability
A critical part of any machine learning pipeline is ensuring that the validation setup is trustworthy. In this project, the private leaderboard score landed close to the full-data level-2 out-of-fold (OOF) estimate, which is encouraging. However, this should be treated as validation evidence rather than proof that the setup is fully unbiased.
## 2. Error Analysis
Aggregate metrics are useful, but they can hide where the model fails. Using the holdout predictions from the stacked model, the confusion matrix, per-author recall, and high-confidence mistakes were inspected.
The confusion matrix for the tuned stacked model on the 70/30 holdout split showed that most predictions fall on the diagonal, while the largest off-diagonal errors come from confusion between MWS (Mary Shelley) and EAP (Edgar Allan Poe).
Per-author recall was relatively balanced across all three authors, suggesting that the model does not rely heavily on a single majority class. The most common misclassification pairs confirmed that the largest errors occur between MWS and EAP, followed by HPL (H.P. Lovecraft) and EAP, showing that the remaining mistakes are mostly between stylistically overlapping authors.
The main takeaway is that the model did not simply collapse into predicting the largest class. The recall scores were close across all three authors, and the errors were bidirectional. MWS and EAP were often confused with each other, while HPL and EAP also overlapped on some short or stylistically neutral sentences.
High-confidence mistakes were also inspected. One notable example was the sentence:
> *”I walked the cellar from end to end.”*
The true author was EAP, but the model assigned HPL a probability above 0.97. This is a useful reminder that single-sentence authorship can be underdetermined. Some sentences simply do not carry enough distinctive stylistic evidence for a sparse linear model to separate three similar gothic authors reliably.
## 3. A Representation Survey
To put the main pipeline in context, several foundational representations were tested on the same holdout split.
**Bag-of-Words** used word counts with unigrams and bigrams. **BM25** was treated as a nearest-neighbor classifier — not its natural use case, but useful as a point of comparison. **Word2Vec and FastText** embeddings were trained on the training split, with each sentence represented as an IDF-weighted average of its word vectors.
The results showed that sparse count-based features performed better than BM25 retrieval and simple averaged Word2Vec/FastText embeddings on this short-text authorship task. This does not mean Word2Vec or FastText are generally weak — it means that for this short-text authorship task, averaging word vectors blurred many of the stylistic details that sparse word, character, and punctuation features preserved.
## 4. Results at a Glance
All holdout rows use the same stratified 70/30 split, so they are directly comparable. The full-data level-2 OOF estimate is included as a separate sanity check for the final stacked model, though it is not directly comparable to the holdout rows because it uses a different evaluation setup.
The final Kaggle submission achieved a **private log loss of 0.30414** and a **public log loss of 0.33621**.
## 5. What Actually Helped
Most of the useful improvements came from better representations and cleaner validation, not from adding complexity for its own sake.
**Sparse word and character features carried the strongest signal.** The task is stylistic, and sparse n-gram features captured the subtle differences between the three gothic authors far more effectively than dense averaged embeddings.
—
*Original article source: [Insight Media Group Contributor](https://contributor.insightmediagroup.io)*# Classical NLP for Authorship Attribution: A Spooky Author Identification Walkthrough
## Introduction
Authorship attribution is one of those NLP tasks where the smallest linguistic fingerprints — punctuation habits, function word preferences, subword patterns — can carry an enormous amount of signal. This project explores how far classical, sparse-feature NLP can go on Kaggle’s *Spooky Author Identification* dataset, built from public-domain fiction excerpts by Edgar Allan Poe (EAP), H.P. Lovecraft (HPL), and Mary Wollstonecraft Shelley (MWS). The goal is straightforward: given a short sentence excerpt, predict which of the three authors wrote it.
The surprising finding was not that transformers were needed, but that they almost weren’t. With the right combination of feature engineering, probability-sensitive tuning, and stacked generalization, a classical NLP pipeline proved remarkably competitive.
—
## The Dataset
The Spooky Author Identification dataset contains short text excerpts tagged with one of three author labels — EAP, HPL, or MWS. Each sample is a sentence or partial sentence drawn from the authors’ fiction. The task is a standard three-class text classification problem optimized on log loss, making probability calibration as important as raw accuracy.
Because the texts are short literary sentences, the stylistic and lexical surface features tend to carry more signal than deep semantic meaning, making this a natural fit for classical sparse representations.
—
## Key Findings
### Dense Vector Pooling Tends to Smooth Away Detail
One of the early observations was that pooled dense vector representations — approaches that compress sentence meaning into a single embedding — tended to smooth away fine-grained stylistic details that are actually useful for authorship attribution. These subtle cues, such as punctuation patterns-gram habits, are precisely the kind of features that pooled dense vectors dilute in exchange for generalization.
### Punctuation and Character N-grams Improved Authorship Modeling
Adding style-aware features, including punctuation-based features and character n-grams, proved to be a meaningful upgrade. Incorporating these features increased the Vowpal Wabbit (VW) holdout accuracy from **0.8332 to 0.8553**. This underscores how much authorship signal lives at the surface level — in how authors structure their sentences, not just which words they choose.
### TF-IDF Improved Probability Quality
The tuned TF-IDF ensemble did not produce a dramatic jump in accuracy, but it delivered a strong log loss result. Since the Kaggle competition optimizes for log loss rather than accuracy, this outcome is especially meaningful. A model that produces well-calibrated probability estimates will score better under log loss, even if its top-1 accuracy improvement is modest.
### Stacking Helped Most with Log Loss
The stacked model improved holdout log loss from **0.3843 to 0.3504**. This gain indicates that the meta-learner found a better way to combine probability estimates across base models than a simple flat average could achieve. Stacking the outputs of multiple models with complementary strengths yielded a probability ensemble that was more reliable overall.
### Evaluation Separation Matters
The project maintained a strict separation between three evaluation benchmarks: the 70/30 holdout, the full-data level-2 out-of-fold (OOF) estimate, and the Kaggle leaderboard scores. Each of these answers a different question — internal model selection, full-data generalization, and public benchmark performance — and conflating them would have made the results appear more certain than they actually were.
—
## Final Results
The strongest classical pipeline reached **0.8687 accuracy** and **0.3504 log loss** on the 70/30 holdout split. The final stacked submission scored **0.30414 private** and **0.33621 public** log loss on Kaggle — a very competitive result on a leaderboard optimized for probability quality.
—
## Main Takeaway
The central lesson of this project is not simply that stacking improved the score — it is that authorship attribution rewards attention to detail. Punctuation, subword patterns, function words, and carefully tuned probability estimates all contribute meaningfully. Before reaching for heavier contextual models like fine-tuned transformers, a well-validated sparse-text baseline can still be a serious competitor, especially on short-text domains where stylistic surface features carry significant signal.
—
## Limitations and Next Steps
Several improvements could extend this work:
1. **Nested cross-validation** — The stacking pipeline was evaluated with a single holdout split plus a full-data level-2 OOF estimate. A fully nested cross-validation design would provide a more conservative estimate of the entire modeling and tuning process.
2. **Calibration diagnostics** — While log loss served as the main probability-quality metric, explicit calibration diagnostics such as reliability diagrams or expected calibration error were not included. Since the final objective is probability quality, calibration analysis is a natural next step.
3. **Transformer baselines** — No comparison was made against transformer models such as DistilBERT or BERT. A fine-tuned transformer would be the obvious next benchmark, particularly to evaluate how much contextual representation improves over sparse classical features on short literary sentences.
4. **Broader hyperparameter search** — The search was intentionally limited. Expanding the space over TF-IDF ranges, VW settings, smoothing values, regularization strengths, and stacking design choices could improve the final score.
5. **Domain specificity** — The dataset is small and domain-specific. Conclusions drawn here apply to short-text authorship attribution in this setting, not to a universal ranking of NLP methods.
—
## Data Source and License
This article uses Kaggle’s *Spooky Author Identification* dataset, a text classification dataset built from excerpts of public-domain fiction by Edgar Allan Poe, H.P. Lovecraft, and Mary Wollstonecraft Shelley. The task is to predict the author of each sentence among three labels: **EAP** for Edgar Allan Poe, **HPL** for H.P. Lovecraft, and **MWS** for Mary Wollstonecraft Shelley.
The dataset is listed on Kaggle under the CC BY 4.0 license, which permits sharing and adaptation, including for commercial purposes, provided appropriate attribution is given.
—
*Original article: “Preserved details that pooled dense vectors tended to smooth away” — Classical NLP spooky author attribution walkthrough. Available at the author’s website and Kaggle notebook.*



