What happens when you prompt a large language model to stand in for 6,000 American households answering inflation questions? Recent research shows that LLMs can match the average responses of major household surveys with remarkable precision—within a percentage point (Zarifhonarvar, 2026). For instance, the 2020 Survey of Consumer Expectations (SCE) recorded a median one-year-ahead inflation expectation of roughly 3%. When given realistic personas and a knowledge-cutoff instruction, a prompted LLM produced a median estimate of about 3%. This level of accuracy has led some to propose LLMs as a low-cost, high-frequency supplement to established surveys like the SCE, Michigan Survey, and Survey of Professional Forecasters.
In our recent paper, Can LLMs Mimic Household Surveys?, co-authored with Ami Dalloul from the University of Duisburg-Essen, we examine the second moment of the distribution—the part that reveals whether the model captures a single consensus view or a diverse population of opinions. It is precisely here that the apparent success of LLM-based surveys falls apart. The same Llama-3 model that nails the SCE median to within a percentage point confines 95% of its simulated respondents within a two-percentage-point window. By contrast, actual 2020 SCE responses span from roughly minus 25 to plus 27 percent. In short, the average is correct, but the population behind it does not exist. Running a simulation with several thousand LLM personas effectively reduces to a single representative agent.
Figure 1: Dispersion of Real-World and Synthetic Survey Populations
Note: The left panel shows how individual 2020 SCE respondents are dispersed around their mean. The wide spread reflects genuinely diverse beliefs across participants. The middle panel applies the same visualization to synthetic responses from a Llama-3.1-8B-Instruct model prompted with personas matching the SCE demographic distribution. The scatter collapses into a near-point mass. The model captures the mean and discards everything else. The right panel uses the same Llama model after unlearning via gradient ascent (GA). The unlearned model achieves a more realistic dispersion and no longer collapses around the mode.
Mode collapse
We benchmarked five LLMs (Llama-3-8B, Llama-3-70B, Claude-3.7-Sonnet, DeepSeek-V3, GPT-4o) against the SCE, Michigan Survey, and Survey of Professional Forecasters. In the human surveys, 44 to 70% of respondents give answers more than 3 percentage points away from the modal reply; in the LLM samples, that share is essentially zero.
Standard remedies from the survey-simulation literature fail to fix this problem. Census-derived personas with rich and varied characteristics, zero-shot knowledge-cutoff instructions (“you do not know events after June 2018”), and explicit “do not look up statistics” prompts all default to the same narrow distribution. The likely cause is that the LLMs encounter CPI tables, news coverage of FRBNY survey releases, and academic replications in their training corpora. When asked for the median 2020 inflation expectation, the model is effectively retrieving memorized data. The weight of that training data overwhelms whatever the prompt instructions ask it to do.
Unlearning the LLMs
If memorized statistics are the problem, a potential fix is to remove them from the model’s weights rather than ask it to ignore them. We applied two unlearning methods to Llama-3.1-8B-Instruct, an open-source model that allows us to modify its weights:
- Gradient Ascent (GA) maximizes prediction loss on a forget set of CPI series and survey aggregates, with a retain loss on micro-survey reasoning so general capability survives.
- Negative Preference Optimization (NPO) treats the forget set as dispreferred completions and minimizes a bounded preference loss against a reference model.
The data we ask the model to forget is the official inflation record itself: monthly CPI series and published mean inflation expectations from the FRBNY SCE and Michigan surveys. The unlearning effect on the response distribution is summarized in Table 1.
Table 1 Tail Accuracy with Different Unlearning Strategies

Note: Unlearning strategies to mitigate mode collapse. Gradient ascent (GA) is a targeted unlearning method where the model is fine-tuned to maximize loss on a dataset of official CPI statistics while minimizing loss, or retaining (RT), on a dataset of micro-survey data. Negative preference optimization (NPO) treats official statistics as negative samples to penalize their generation while treating retaining (RT) samples as positive. Synthetic survey replies of inflation expectations as percentage deviations from the mode and mean (in brackets) within bins of exact matches, ± 1, and > 3 % deviations. Tail Acc. measures closeness to the FRBNY tail dispersion benchmark (> ± 3.0 = 44.38).
The baseline Llama-3 (which includes prompt-based unlearning) produces an exact mode match on 92% of replies and zero replies more than 3pp away. Tail accuracy against the SCE benchmark of 44% is therefore zero. After GA, exact matches drop to 24%, and 43% of replies move beyond ±3pp; tail accuracy reaches 97%. NPO is comparable at 37% and 43%, with 98% tail accuracy. In other words, both unlearning methods appear to recover a more realistic distribution.
Figure 2 Dispersion of LLMs vs. Unlearning Models

Note: The left-hand side plots kernel density estimates of 2020 inflation expectations from the FRBNY SCE and two Llama-3 variants trained with unlearning methods, gradient ascent (GA) and negative preference optimization (NPO). Both unlearning variants cover the range where FRBNY SCE places probability mass, though they still remain more concentrated than the human benchmark and slightly skewed to higher means. The right-hand side compares the KDEs of prompted LLM-generated expectations (GPT-4o, Llama-3, etc.) to FRBNY SCE in 2020. The LLM curves (left axis) are tightly clustered around a narrow region, while the FRBNY SCE curve remains much broader. The LLMs can match central tendency yet fail to reproduce the cross-sectional spread of survey micro-data. Bandwidth = 0.5 for all KDEs.
The kernel densities (Figure 2) show that off-the-shelf models pile probability mass into a thin spike near the mean. The unlearned variants spread mass across the range where the human respondents of the SCE put it.
Simulating a randomized controlled trial
A wider distribution is necessary but not sufficient for the application that motivated our paper: replicating survey RCTs with synthetic
Randomized controlled trials (RCTs) are costly. Once data collection wraps up, a researcher can’t revisit the study to explore a new hypothesis or adjust a treatment. Synthetic agents could solve this limitation—provided their responses mirror those of real participants.
To validate this idea, we replicated an actual RCT by Coibion, Gorodnichenko, and Weber (2022). Participants were randomly divided into several groups: a control group received no information, various treatment groups each saw a different piece of economic data (such as the actual past inflation rate, the Fed’s 2% target, etc.), and a placebo group viewed content unrelated to inflation. Every participant first stated their initial inflation expectation, then saw their assigned information, and finally reported an updated expectation. The shift between the initial and updated expectation measured their revision.
A treatment is effective if its revisions clearly differ from the control group’s, and if the direction of change aligns with economic theory: downward revisions after FOMC communication, upward revisions after news of higher gasoline prices. For synthetic agents, the test is whether their revisions separate in the same way human respondents’ did.
We created 30,000 synthetic personas with Census-based demographics, and measured the average treatment effect across three LLMs, including our unlearned models. The first test focused on priors: the inflation expectations agents reported before seeing any information. Figure 3 shows the mean and standard deviation of these priors across demographic subgroups for the human benchmark and the three LLMs. One unlearning model (Llama-GA) closely matched the human aggregate in both level and spread. However, the other unlearning method (NPO) fell short. This suggests unlearning isn’t a universal fix.
Figure 3 Model Estimates of Perceived Inflation

Note: Each panel displays results by demographic subgroup for the human benchmark (Coibion et al., 2022), the baseline Llama-3, and its two unlearned variants (GA, NPO). The dashed line marks the human “All” value. Left-hand side: Llama-3 and Llama-NPO show almost no variation across demographic groups; Llama-GA matches the human average but fails to capture the within-group ordering (e.g., it predicts the highest mean for “college or more” and “Inc T3,” which contradicts the human pattern). Right-hand side: the unlearned GA model restores much of the variation lost in the base model.
The next test examined how priors changed after the information treatment. In the baseline Llama-3 and Llama-NPO models, revisions were nearly identical across all treatments, showing no treatment effect. Only Llama-GA showed separation between treatments. Within its largest subgroup (80% of the sample), the four monetary-policy treatments (past inflation, Fed target, FOMC forecast, FOMC statement) produced negative and significant revisions—matching the direction and approximate size of human responses in Coibion et al.
Key Takeaways
For researchers and practitioners considering LLMs for survey work, here’s the bottom line:
- LLMs struggle to imitate diverse personas. Simulating surveys often means one agent answering the same question thousands of times, consistently landing near the mean—sometimes to four decimal places.
- Targeted unlearning can recover much of the variation and a meaningful portion of treatment effects seen in human RCTs. That said, results vary by unlearning method.
- The gap between mean accuracy and distributional accuracy is significant enough that any study using synthetic respondents should report both.
Future research should treat distributional accuracy and data leakage as equally important constraints, not afterthoughts. Advances will come from methods that address both what models know and how their outputs are assessed—with more focus on variation, extremes, and belief updating rather than just averages.
References
Coibion, O., Y. Gorodnichenko, and M. Weber (2022). Monetary policy communications and their effects on household inflation expectations. Journal of Political Economy 130(6), 1537–1584.
Dalloul, A., Pfeifer, M. (2026). Can LLMs Mimic Household Surveys?: From Representative Agents to Population Distributions. SSRN preprint. Link to working paper
Zarifhonarvar, A. (2026). Generating inflation expectations with large language models. Journal of Monetary Economics 157, 103859
Replication Data
Dalloul, A., Pfeifer, M. (2026). Replication Data for: “Can LLMs Mimic Household Surveys?: From Representative Agents to Population Distributions”, Harvard Dataverse, V1.



