It’s no longer a question of whether AI can write code — it’s whether we can truly rely on the code it produces.
In recent years, ChatGPT and similar large language models have become a regular part of the daily routines of students, analysts, researchers, and data scientists. Chances are, most of us have already turned to AI tools at some point — whether to draft a Python function, troubleshoot an error, automate a tedious task, or convert code from one programming language to another.
There’s a big gap, though, between asking ChatGPT to write a quick helper function and asking it to code a sophisticated econometric technique.
Can ChatGPT properly implement a Difference-in-Differences model? Can it code Inverse Probability Treatment Weighting from scratch? Can it replicate a Regression Discontinuity design — and pull off all of these not just in Python, but also in R and Stata?
That’s precisely what drew me to the paper “Can AI write your code? A case study of ChatGPT’s statistical coding capabilities for quantitative research” by Winberg et al. It appeared online on January 22, 2026, in Health Economics Review. The authors put ChatGPT-4.0 Pro to the test, evaluating its ability to generate code for causal inference tasks across Python, R, and Stata, using reference solutions from Scott Cunningham’s Causal Inference: The Mixtape.
Most prior work I’d encountered on this topic focused on comparatively straightforward coding challenges: minor automations, descriptive statistics, data wrangling, basic analysis, or code generation in languages like Python, R, and SAS. This study pushes further. It asks whether ChatGPT can genuinely support quantitative research in more demanding scenarios — where the code isn’t just technical, but also methodologically nuanced.
The authors zero in on three widely used causal inference approaches:
- Difference-in-Differences (Diff-in-Diff);
- Inverse Probability Treatment Weighting (IPTW);
- Regression Discontinuity (RD).
Here, I’ll walk through the study step by step. First, we’ll explore what sets this research apart for quantitative researchers. Next, we’ll review the authors’ methodology. Then, we’ll see how ChatGPT’s output was assessed. Finally, we’ll discuss how the rise of LLMs has reshaped my own workflow.
What Sets This Study Apart?
A lot of earlier research on ChatGPT’s coding skills relied on subjective evaluation. Essentially, researchers reviewed the generated code and made a judgment call on whether it looked right.
That’s a reasonable starting point, but it comes with a clear drawback: the results hinge on the reviewer’s own interpretation.
Winberg et al. take a more rigorous route. They pit ChatGPT’s code against standardized reference implementations and benchmark outputs from Causal Inference: The Mixtape. This lets them evaluate the code not just on how it looks, but on whether it actually reproduces the expected results.
Another key strength is that Stata is included in the evaluation.
This is significant because many empirical researchers — especially in economics, public policy, and health economics — still lean heavily on Stata. Yet conversations around AI coding assistants tend to center on Python and R. By bringing Stata into the picture, the authors test ChatGPT in a language that’s central to applied econometric work but rarely covered in AI coding studies.
The Study’s Methodology
The authors assess ChatGPT-4.0 Pro, the paid tier available at the time of the study. Their aim is to gauge how effectively it performs when tasked with coding causal inference analyses in Python, R, and Stata.
They draw on publicly available data and problem sets from Causal Inference: The Mixtape. This textbook is well-known in applied econometrics circles and offers worked examples in R, Stata, and Python. The reference environments used were R 3.6.0, Stata 18, and Python 3.13.
The authors concentrate on three causal inference methods:
- Difference-in-Differences;
- Inverse Probability Treatment Weighting;
- Regression Discontinuity.
These were selected because they’re staples of empirical research and demand more than just correct syntax. They require careful data preparation, appropriate model specification, and meaningful interpretation of results.
The study follows a three-stage process.
Prompting ChatGPT With Econometric Problem Sets
The first stage involves feeding ChatGPT problem sets and asking it to generate the corresponding econometric code.
Take, for instance, one of the Difference-in-Differences problem sets. The backdrop is the legalization of abortion in five U.S. states before the nationwide shift triggered by the 1973 Roe v. Wade decision. The objective is to estimate whether early abortion legalization had an impact on gonorrhea rates among adolescent females aged 15–19.
Rather than relying on a simple post-treatment dummy, the prompt directs ChatGPT to use year-by-treatment interactions to capture how treatment effects evolve over time.
This kind of prompt goes well beyond a basic regression request. It demands that the model grasp the policy context, define the treatment variable correctly, structure the interaction terms, and produce suitable code.
The authors construct analogous problem sets for IPTW and RD as well.
Requesting End-to-End Coding Workflows
In the second stage, the authors supply more detailed prompts. These ask ChatGPT to reproduce comprehensive coding tasks from The Mixtape, covering everything from data wrangling to econometric analysis to visualization.
This matters because real-world research rarely boils down to a single model command. A typical researcher needs to load data, clean variables, construct indicators, run models, build tables, create figures, and cross-check findings.
By testing full workflows, the authors examine whether ChatGPT can handle the practical complexity of applied quantitative research.
Executing the Code and Cross-Checking Results
In the third stage, the generated code is run in the relevant environment — Python, R, or Stata.
The authors then compare the results from ChatGPT’s code against the benchmark outputs from The Mixtape.
How the Prompts Were Crafted
One of the most intriguing facets of the study is the
The way the prompts were designed.
The research team brought in four individuals with strong backgrounds in econometrics. Among them, two had already earned their PhDs, while two were still completing their doctoral research. Each of the first three researchers focused on a single programming language—Python, R, or Stata—while the fourth repeated the entire workflow across all three languages to confirm reliability and check for consistency.
This approach mirrors real-world scenarios where researchers use ChatGPT interactively: posing questions, receiving executable code, running it, identifying issues, and iterating until results are satisfactory.
Yet this method introduces a potential caveat. If every researcher crafts their own prompts independently, variations in outcomes could stem from differences in how questions were phrased rather than actual differences in ChatGPT’s coding capability.
To minimize this risk, the authors established a standardized prompt framework. Through teamwork, they developed instructions that were precise, well-organized, and versatile enough to span multiple tasks. The aim was to equip ChatGPT with sufficient detail to tackle each challenge without tailoring prompts too narrowly to any single scenario.
The success of the generated code depends heavily on how well the prompt is constructed. A vague query often leads to generic or flawed solutions. Conversely, an overly narrow prompt may solve one particular issue but lack broader applicability.
An effective prompt should outline the background context, specify the intended methodology, identify key variables, describe the expected results, and state any underlying assumptions.
The Five Performance Indicators
The authors assessed ChatGPT’s effectiveness using five core criteria: correctness, performance, runtime faults, modifications needed, and repeatability.
Correctness was determined by comparing outputs from ChatGPT-written code against those from The Mixtape’s established reference solutions.
Judgment followed a straightforward pass/fail rule: if results aligned with the benchmark, they were labeled correct; otherwise, they were marked incorrect.
Performance was gauged by counting the lines of code produced by ChatGPT relative to standard reference implementations.
Though not a definitive measure, this metric offers a reasonable estimate.
The authors also tracked whether ChatGPT-generated scripts triggered runtime faults.
This indicator is highly practical. When code crashes, users must troubleshoot it. For someone without deep knowledge of statistics or coding, debugging alone can become a serious obstacle.
Modifications refers to situations where code runs successfully but still needs tweaks, extra guidance, or manual corrections to yield valid results.
This is especially critical because some mistakes aren’t obvious. Code can execute flawlessly yet produce faulty models, apply incorrect transformations, or display misleading visuals.
Repeatability was tested via replication. A fourth researcher reproduced all tasks using identical prompts across Python, R, and Stata, starting fresh with a new ChatGPT session and no prior chat history.
This step was meant to verify whether ChatGPT followed similar reasoning and structure when given the same input from distinct users.
Consistency matters greatly. Academics must carefully document and verify outputs if prompts lead to wildly different responses between sessions.
What Did the Study Find?
The overall findings present a measured assessment. The table below summarizes key results.
According to the study, ChatGPT performed better in Python and R than in Stata. Researchers noted that accurate solutions were consistently generated in both R and Python for most tasks, whereas Stata proved less dependable.
This outcome isn’t entirely unexpected.
Python and R dominate data science, statistical analysis, and machine learning. They benefit from vibrant online communities, thorough documentation, and extensive public code repositories. Given that large language models learn from vast datasets, it makes sense they’d excel where training data is richer.
Still, these findings should be interpreted cautiously. The study isn’t a massive benchmark covering countless tasks, but a focused review of specific econometrics exercises. We shouldn’t assume ChatGPT excels universally in Python or R over Stata universally.
A more careful interpretation would be:
For the causal inference problems examined here, ChatGPT showed greater reliability in Python and R compared to Stata.
What the Rise of LLMs Has Changed in My Daily Workflow
What makes this research particularly relevant to me is its direct connection to my own experience, both personally and professionally. In recent years, we’ve moved from ChatGPT Pro 4.0 to ChatGPT Pro 5.5. Here’s how embracing these tools has reshaped my workflow.
Before LLMs, preparing for quantitative projects meant lengthy literature reviews. I had to locate relevant academic papers, decode complex methodologies, weigh different approaches, and adapt them to our datasets.
Now, with ChatGPT, exploring new topics is much quicker. It doesn’t replace deep reading—but it accelerates initial exploration, clarifies terminology, and sharpens methodological questions.
The shift has been even more pronounced at work, especially regarding programming habits.
Previously, we relied on SAS for data retrieval, cleaning, and transformation. SAS remains powerful for handling enterprise-level data processing. However, for advanced modeling and visualization, we turned to R, which suited our statistical experimentation needs.
With LLM adoption, we shifted much of our workload to Python. Beyond its simplicity and popularity, we observed firsthand that ChatGPT often delivers cleaner, more accurate Python examples, with fewer bugs and reusable patterns.
Though we didn’t replicate Winberg et al.’s rigorous methodology, our hands-on experience led us to similar conclusions:
This shift reflects feedback from the modelers on our team and aligns with a broader, long-term strategic direction. In practice, AI has reshaped not only how we write code but also the infrastructure we rely on. We’ve transitioned from working primarily in SAS Studio and RStudio to a workflow centered around VS Code, which integrates far more seamlessly with AI tools like ChatGPT, Claude, and GitHub Copilot.
While this change may seem purely technical on the surface, it actually represents a deeper transformation. AI doesn’t just boost productivity—it actively shapes the programming languages we select, the tools we adopt, and how we structure our workflows.
A concrete illustration of this impact is how we gather external data. Our work often requires publicly available datasets: INSEE statistics, climate records, IPCC reports, NGFS scenarios for climate stress testing, or other data used in ESG risk modeling.
Historically, such tasks could take days or even weeks. We’d need to locate the right source, decipher file formats, download raw data, clean and restructure it, and finally prepare it for modeling. Today, large language models (LLMs) can dramatically accelerate this process.
For instance, I recently needed to extract NAF codes along with their labels from the INSEE website in a ready-to-use format. Previously, this would likely have taken several hours. With a few clear, well-crafted prompts, I quickly generated a script that fetched the data, cleaned the codes by removing dots, and exported an Excel file—ready to use. This isn’t just about saving time; it reduces the gap between having an idea and executing it.
To me, this is one of the greatest contributions LLMs offer statisticians and quantitative analysts. They excel at data wrangling, statistical modeling, mathematical programming, reporting, and formatting results.
They’ve also become valuable for creating deliverables: structuring documents, refining explanations, polishing tables, annotating figures, and interpreting outcomes. Earlier versions of ChatGPT struggled with these tasks—especially in technical accuracy and citation handling—but recent models perform significantly better, though they still demand careful review.
In my day-to-day work, I treat LLMs less like autonomous experts and more like ultra-efficient research assistants. They can accomplish in hours what once might have taken a research assistant several days: exploring a methodology, drafting code, generating a preliminary chart, rephrasing an interpretation, or automating parts of a report.
However, this speed comes with a critical caveat: human oversight and validation remain non-negotiable.
The danger of AI “hallucinations” is not hypothetical. A recent case underscored this vividly: according to the Financial Times, EY Canada pulled a study promoting its cybersecurity services after it was found to include fabricated data, misattributed citations, and even a reference to a McKinsey report that never existed.
This is precisely why I find the study by Winberg et al. so relevant. It doesn’t merely ask whether ChatGPT can write code—it raises a far more important question: under what circumstances can we trust code generated by AI?
For me, the answer is straightforward. We can use LLMs to move faster—but never to offload the researcher’s responsibility. The analyst must still verify assumptions, validate datasets, test code rigorously, benchmark results, and ensure interpretations are sound.
In short, AI is fundamentally changing how we work—but it doesn’t eliminate the need for deep expertise. If anything, it makes expertise more crucial. The more powerful the tool, the more essential it is to know when to trust it and when to question it.
Looking ahead, AI adoption will continue reshaping our workflows. Some processes will become leaner, others will fade, and entirely new, more sophisticated approaches will emerge. To stay competitive, we must keep learning, remain adaptable, and be willing to embed these tools thoughtfully into our professional routines.
At the same time, AI is transforming how knowledge is created and shared. Because these tools dramatically increase productivity, a paper that once demanded a month to write might now be completed in a week. This has clear benefits: it lowers barriers to publishing, enables more people to contribute ideas, and speeds up the dissemination of knowledge.
But it also introduces a new challenge. If everyone can produce content faster, the digital landscape will become even more saturated. Individual articles may struggle for visibility. Some writers may feel disheartened—especially if their thoughtful work goes unnoticed despite the effort invested.
I believe this will give rise to a new kind of divide: between those who wield AI skillfully and those who don’t, but also between those churning out content for volume’s sake and those driven by genuine passion for their subject.
In the long term, I’m convinced the people who endure will be the intellectually curious—the ones who seek to learn, think deeply, and share insights with sincerity. AI may accelerate writing, but it will never replace curiosity, discipline, or the desire to contribute something meaningful.
References
Winberg, D., Tsai, E., Tang, T., Xuan, D., Marchi, N., & Shi, L. (2026). Can AI write your code? A case study of chatgpt’s statistical coding capabilities for quantitative research. Health Economics Review.



