Imagine a language model that’s never encountered the internet, smartphones, or even World War II. That’s not just a thought experiment—it’s exactly what a team of researchers led by Nick Levine, David Duvenaud, and Alec Radford has created. They call it talkie, and it might be the most historically faithful large language model ever released to the public.
Talkie is a 13-billion-parameter open-weight language model trained exclusively on English text published before 1931. Developed by a non-profit team, it introduces what the researchers describe as a “vintage language model”—an LM whose knowledge cutoff isn’t defined by when it was trained, but by a precise moment in history.
What Exactly Is a Vintage Language Model?
To grasp talkie, you first need to understand its core idea. Most modern LLMs—like GPT-4, LLaMA, Mistral, and others—are trained on massive datasets scraped from today’s internet. Their knowledge reflects the world as it is now, up to their training cutoffs. A vintage language model reverses this approach: it’s intentionally trained only on historical data so its “worldview” is frozen at a specific point in the past.
For talkie, that boundary is December 31, 1930—a deliberate choice, because works published before 1931 entered the public domain in the United States on that date, making them legally usable for training.
The model—officially named talkie-1930-13b-base—was trained on 260 billion tokens of pre-1931 English text, including books, newspapers, magazines, scientific journals, patents, and legal case records. A separate instruction-tuned version called talkie-1930-13b-it is also available for interactive conversations. The team has launched a live 24/7 demo at talkie-lm.com/chat, where Claude Sonnet 4.6 continuously prompts the instruction-funed model, letting visitors observe talkie’s voice and knowledge in real time.
Why Build a Model From 1930?
This isn’t just a nostalgia project. The research team has identified several concrete, technically valuable use cases that make talkie especially interesting to the AI research community.
1. Contamination-free generalization tests: Benchmark contamination—where test data accidentally appears in training data—is one of the most persistent and underappreciated issues in modern LLM evaluation. Since talkie was trained solely on pre-1931 text, it’s inherently free from contamination with respect to any modern benchmark. This provides a clean experimental setup to study how well an LM can generalize beyond its training data. For example, the team tested whether talkie could learn Python—a language that didn’t exist in 1930—using just a few in-context examples. On the HumanEval benchmark, they found that while vintage models fall far behind internet-trained ones, they are “slowly but steadily improving at this task with scale.”
2. Studying how surprising future events feel to old models: Inspired by Calcifer Computing’s work on Temporal Language Models, the team used talkie to measure the surprisal (in bits per byte) of historical event summaries from the New York Times’ “On This Day” column. Events after 1930—the model’s cutoff—are consistently more surprising to talkie, with the effect strongest for 1950s and 1960s events, then leveling off. This sets up a controlled way to study how forecasting ability scales with model size and how performance degrades over longer time spans.
3. Exploring LLM identity and behavioral roots: Because talkie was trained on a fundamentally different data distribution than any modern model, it raises questions about what truly shapes an LLM’s “identity.” Today’s LLMs—regardless of who builds them—all trace back to internet data, either directly or through distillation and synthetic data pipelines. Talkie completely breaks that lineage, giving researchers a tool to examine which behaviors and capabilities are inherent to language modeling itself—and which are artifacts of being trained on today’s web.
The Training Pipeline: Why This Is So Challenging
Creating a vintage language model isn’t as simple as filtering a modern dataset by date. The talkie team ran into several non-trivial engineering hurdles.
Temporal leakage is perhaps the most critical. If any post-1930 text sneaks into the training data—due to misdated documents or older works with modern editorial additions—the model’s historical integrity breaks down. An earlier 7B version of talkie clearly knew about the Roosevelt presidency and New Deal laws, exposing flaws in the filtering process. To fix this, the team built a document-level n-gram anachronism classifier to screen the data. But they admit it’s still imperfect—the 13B model retains some awareness of World War II and the postwar world.
Data quality is another major challenge. Since there was no digital publishing in 1930, every token in talkie’s training corpus had to be transcribed from physical sources using optical character recognition (OCR). Controlled experiments showed that training on text transcribed by standard OCR systems achieved only 30% of the learning efficiency compared to training on human-transcribed versions of the same texts. Basic regex cleaning improved that to 70%, but a significant gap remained. To close it, the team is building a specialized vintage OCR system tailored for historical document layouts.
Vintage post-training: The instruction-tuning phase required building an entirely new pipeline from scratch. Using modern instruction-response pairs would inject contemporary expectations into the model’s behavior. Instead, the team generated instruction-response pairs from structured historical texts: etiquette guides, letter-writing manuals, cookbooks, dictionaries, encyclopedias, collections of poetry and fables. They then applied online direct preference optimization (DPO), using Claude Sonnet 4.6 as a judge, boosting talkie’s average instruction-following score from 2.0 to 3.4 on a five-point scale. A final round of supervised fine-tuning used rejection-sampled, multi-turn synthetic conversations generated between Claude Opus 4.6 and talkie.
Benchmarks: How Does a 1930 Model Compare?
To provide meaningful context, the research team also trained a “modern twin”—an architecturally identical 13B model trained on contemporary internet data.
(FineWeb) — and compare it directly to Talkie. As expected, Talkie scores lower on standard language model benchmarks. However, when accounting for query anachronism — that is, removing questions referencing concepts that didn’t exist in 1930 — the performance gap shrinks by about half. The team behind the project reports strong parity in fundamental language comprehension and math tasks, linking the remaining difference mainly to OCR errors and differences in subject matter.
Key Takeaways
- Talkie is a 13-billion-parameter open-source “historical language model” trained exclusively on 260 billion tokens of English text published before 1931 — making it the largest vintage language model to date, with a knowledge cutoff of December 31, 1930.
- Benchmark contamination is impossible by design. Because Talkie has never encountered modern data, it provides a uniquely clean testbed for generalization studies — including whether a model with zero awareness of digital computers can learn to generate Python code purely from in-context examples.
- Creating a vintage language model isn’t just about filtering by date. The research team had to address several challenges: temporal leakage (accidental inclusion of post-1930 material), OCR noise that reduced training efficiency to only 30% of human-transcribed text, and developing a post-training pipeline entirely from pre-1931 sources such as etiquette guides and encyclopedias.
- Two model checkpoints are available under the Apache 2.0 license:
talkie-1930-13b-basefor raw text generation andtalkie-1930-13b-itfor dialogue — though running them locally demands a CUDA GPU with a minimum of 28 GB of VRAM. - Larger versions are in the works. The team aims to release a GPT-scale historical model by summer 2026, supported by a corpus they project can grow to over a trillion tokens — potentially powerful enough to rival the capabilities of the original ChatGPT, frozen in 1930.
Check out the Model Weights, Repo and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? You can now join us on Telegram too.
Interested in partnering with us to promote your GitHub Repo, Hugging Face Page, Product Launch, Webinar, or more? Connect with us



