Most manufacturing ML fashions don’t decay easily — they fail in sudden, unpredictable shocks. After we match an exponential “forgetting curve” to 555,000 production-like fraud transactions, it returned R² = −0.31, that means it carried out worse than predicting the imply.
Earlier than you set (or belief) any retraining schedule, run this 3-line diagnostic in your present weekly metrics:
report = tracker.report()
print(report.forgetting_regime) # "smooth" or "episodic"
print(report.fit_r_squared) # < 0.4 → abandon schedule assumptions- R² ≥ 0.4 → Clean regime → scheduled retraining works
- R² < 0.4 → Episodic regime → use shock detection, not calendars
In case your R² is beneath 0.4, your mannequin isn’t “decaying” — and all the pieces derived from a half-life estimate is probably going deceptive.
It Began With One Week
It was Week 7.
Recall dropped from 0.9375 to 0.7500 in seven days flat. No alert fired. The mixture month-to-month metric moved just a few factors — nicely inside tolerance. The dashboard confirmed inexperienced.
That single week erased three weeks of mannequin enchancment. Dozens of fraud instances that the mannequin used to catch walked straight via undetected. And the usual exponential decay mannequin — the mathematical spine of each retraining schedule I had ever constructed — didn’t simply fail to foretell it.
It predicted the other.
R² = −0.31. Worse than a flat line.
That quantity broke one thing in how I take into consideration MLOps. Not dramatically. Quietly. The type of break that makes you return and re-examine an assumption you’ve been carrying for years with out ever questioning it.
This text is about that assumption, why it’s incorrect for a complete class of manufacturing ML techniques, and what to do as an alternative — backed by actual numbers on a public dataset you’ll be able to reproduce in a day.
Full code:
The Assumption No person Questions
Your complete retraining-schedule business is constructed on a single thought borrowed from a Nineteenth-century German psychologist.
In 1885, Hermann Ebbinghaus performed a collection of self-experiments on human reminiscence — memorising lists of nonsense syllables, measuring his personal retention at mounted intervals, and plotting the outcomes over time [1]. What he documented was a clear exponential relationship:
R(t) = R₀ · e^(−λt)Reminiscence fades easily. Predictably. At a charge proportional to how a lot reminiscence stays. The curve turned one of the vital replicated findings in cognitive psychology and stays a foundational reference in reminiscence analysis to at the present time.
A century later, the machine studying group borrowed it wholesale. The logic felt sound: a manufacturing mannequin is uncovered to new patterns over time that it was not skilled on, so its efficiency degrades progressively and constantly. Set a retraining cadence primarily based on the decay charge. Estimate a half-life. Schedule accordingly.
Each main MLOps platform, each “retrain every 30 days” rule of thumb, each automated decay calculator, is downstream of this assumption.
The issue is that no person verified it towards manufacturing information.
So I did.
The Experiment
I used the Kaggle Credit score Card Fraud Detection dataset created by Kartik Shenoy [2] — an artificial dataset of 1.85 million transactions generated utilizing the Sparkov Information Technology software [3], protecting the interval January 2019 to December 2020. The take a look at cut up accommodates 555,719 transactions spanning June to December 2020, with 2,145 confirmed fraud instances (0.39% prevalence).
The simulation was designed to reflect a practical manufacturing deployment:
- Mannequin: LightGBM [4] skilled as soon as on historic information, by no means retrained through the take a look at interval
- Main metric: Recall — in fraud detection, a missed fraud prices orders of magnitude greater than a false alarm, making recall the operationally right goal [5]
- Analysis: Weekly rolling home windows on the hold-out take a look at set, every window containing between 15,000 and 32,000 transactions
- High quality filters: Home windows with fewer than 30 confirmed fraud instances had been excluded — beneath that threshold, weekly recall estimates are statistically unreliable because of the excessive class imbalance
The baseline was established utilizing the imply of the top-3 recall values throughout the primary six qualifying weeks — a technique designed to disregard early warm-up noise whereas monitoring near-peak efficiency.
26 Weeks of Manufacturing Efficiency
Here’s what the complete simulation produced throughout 26 qualifying weekly home windows:
Week 1 [2020-06-21] n=19,982 fraud= 68 R=0.7647
Week 2 [2020-06-28] n=20,025 fraud= 100 R=0.8300
Week 3 [2020-07-05] n=20,182 fraud= 83 R=0.7831
Week 4 [2020-07-12] n=19,777 fraud= 52 R=0.8462
Week 5 [2020-07-19] n=19,898 fraud= 99 R=0.8586
Week 6 [2020-07-26] n=19,733 fraud= 64 R=0.9375 ← peak
Week 7 [2020-08-02] n=20,023 fraud= 152 R=0.7500 ← worst shock (−0.1875)
Week 8 [2020-08-09] n=19,637 fraud= 82 R=0.7439
Week 9 [2020-08-16] n=19,722 fraud= 59 R=0.7966
Week 10 [2020-08-23] n=19,605 fraud= 102 R=0.8922
Week 11 [2020-08-30] n=18,081 fraud= 84 R=0.8690
Week 12 [2020-09-06] n=16,180 fraud= 67 R=0.7910
Week 13 [2020-09-13] n=16,087 fraud= 63 R=0.8413
Week 14 [2020-09-20] n=15,893 fraud= 90 R=0.7444
Week 15 [2020-09-27] n=16,009 fraud= 81 R=0.8272
Week 16 [2020-10-04] n=15,922 fraud= 121 R=0.8264
Week 17 [2020-10-11] n=15,953 fraud= 111 R=0.8559
Week 18 [2020-10-18] n=15,883 fraud= 53 R=0.9245 ← restoration
Week 19 [2020-10-25] n=15,988 fraud= 73 R=0.8630
Week 20 [2020-11-01] n=15,921 fraud= 70 R=0.7429 ← second shock
Week 21 [2020-11-08] n=16,098 fraud= 59 R=0.9322 ← restoration
Week 22 [2020-11-15] n=15,835 fraud= 63 R=0.9206
Week 23 [2020-11-22] n=15,610 fraud= 91 R=0.9121
Week 24 [2020-11-29] n=30,246 fraud= 57 R=0.8596 ← quantity doubles
Week 25 [2020-12-06] n=31,946 fraud= 114 R=0.7895
Week 26 [2020-12-13] n=31,789 fraud= 67 R=0.8507Two home windows had been excluded: the week of December 20 (solely 20 fraud instances) and December 27 (zero fraud instances recorded — an information artefact per the vacation interval).
What −0.31 Truly Means
R² — the coefficient of dedication — measures how a lot variance within the noticed information is defined by the fitted mannequin [6].
- R² = 1.0: Good match. The mannequin explains all noticed variance.
- R² = 0.0: The mannequin does no higher than predicting the imply of the info for each level.
- R² < 0.0: The mannequin is actively dangerous — it introduces extra prediction error than a flat imply line would.
When the exponential decay mannequin returned R² = −0.3091 on this dataset, it was not becoming poorly. It was becoming backwards. The mannequin predicts a delicate slope declining from a secure peak. The info reveals repeated sudden drops and recoveries with no constant directional development.
This isn’t a decay curve. It’s a seismograph.
Two Regimes of Mannequin Forgetting
After observing this sample, I formalised a classification framework primarily based on the R² of the exponential match. Two regimes emerge cleanly:

The sleek regime is the world Ebbinghaus described. Function distributions shift progressively — demographic modifications, sluggish financial cycles, seasonal behaviour patterns that evolve over months. The exponential mannequin matches the noticed information nicely. The half-life estimate is actionable. A scheduled retraining cadence is the right operational response.
The episodic regime is what fraud detection, content material advice, provide chain forecasting, and any area with sudden exterior discontinuities truly seems to be like in manufacturing. Efficiency doesn’t decay — it switches. A brand new fraud sample emerges in a single day. A platform coverage change flips person behaviour. A competitor exits the market and their prospects arrive with totally different traits. A regulatory change alters the transaction combine.
These should not factors on a decay curve. They’re discontinuities. And the R² diagnostic identifies which regime you might be in earlier than you decide to an operations technique constructed on the incorrect assumption.
This sample extends past fraud detection — related episodic conduct seems in advice techniques, demand forecasting, and person conduct modeling.

Why Fraud Detection Is At all times Episodic
The Week 7 collapse was not random noise. Allow us to have a look at what the info truly reveals.
Week 6 (July 26): 64 fraud instances. Recall = 0.9375. The mannequin is close to peak efficiency.
Week 7 (August 2): 152 fraud instances — 2.4 instances extra fraud than the earlier week — and recall collapses to 0.7500. The mannequin missed 38 frauds it might have detected seven days earlier.
A 137% enhance in fraud quantity in a single week doesn’t signify a gradual distribution shift. It represents a regime change — a brand new fraud ring, a newly exploited vulnerability, or an organised marketing campaign that the mannequin had by no means encountered in its coaching information. The mannequin’s realized patterns turned out of the blue inadequate, not progressively inadequate.
Then take into account Week 24 (November 29). Transaction quantity almost doubles — from roughly 16,000 transactions per week to 30,246 — because the Thanksgiving and Black Friday interval begins. Concurrently, the fraud depend drops to 57, giving a fraud charge of 0.19%, the bottom in the complete take a look at interval. The mannequin encounters a volume-to-fraud ratio it has by no means seen. Recall holds at 0.860, however solely as a result of absolutely the fraud depend is low. The precision concurrently collapses, flooding any downstream evaluate queue with false positives.
Neither of those occasions is some extent on a decay curve. Neither could be predicted by a retraining schedule. Each could be caught instantly by a single-week shock detector.
The Diagnostic Framework
The classification is a three-step course of that may be utilized to any present efficiency log.
Step 1: Match the forgetting curve and compute R²
from model_forgetting_curve import ModelForgettingTracker
tracker = ModelForgettingTracker(
metric_name="Recall",
baseline_method="top3", # imply of top-3 early weeks
baseline_window=6, # weeks used to ascertain baseline
retrain_threshold=0.07 # alert threshold: 7% drop from baseline
)
# log your present weekly metrics
for week_recall in your_weekly_metrics:
tracker.log(week_recall)
report = tracker.report()
print(f"Regime : {report.forgetting_regime}")
print(f"R² : {report.fit_r_squared:.3f}")Step 2: Department on regime
if report.fit_r_squared >= 0.4:
# SMOOTH — exponential mannequin is legitimate
print(f"Schedule retrain in {report.predicted_days_to_threshold:.0f} days")
print(f"Half-life: {report.half_life_days:.1f} days")
else:
# EPISODIC — exponential mannequin is invalid, use shock detection
print("Deploy shock detection. Abandon calendar schedule.")Step 3: If episodic, change the schedule with these three mechanisms
import pandas as pd
import numpy as np
recall_series = pd.Sequence(your_weekly_metrics)
fraud_counts = pd.Sequence(your_weekly_fraud_counts)
# Mechanism 1 — single-week shock detector
rolling_mean = recall_series.rolling(window=4).imply()
shock_flags = recall_series < (rolling_mean * 0.92)
# Mechanism 2 — volume-weighted recall (extra dependable than uncooked recall)
weighted_recall = np.common(recall_series, weights=fraud_counts)
# Mechanism 3 — two-consecutive-week set off (reduces false retrain alerts)
breach = recall_series < (recall_series.imply() * (1 - 0.07))
retrain_trigger = breach & breach.shift(1).fillna(False)
print(f"Shock weeks detected : {shock_flags.sum()}")
print(f"Volume-weighted recall : {weighted_recall:.4f}")
print(f"Retrain trigger activated : {retrain_trigger.any()}")The brink of 0.92 within the shock detector (alert when recall drops greater than 8% beneath the 4-week rolling imply) and the retrain threshold of 0.07 relative to the long-run baseline are beginning factors, not mounted guidelines. Calibrate each towards your area’s price asymmetry — the ratio of missed-fraud price to false-alarm price — and your labelling latency.

The Full Diagnostic Report
============================================================
FORGETTING CURVE REPORT
============================================================
Baseline Recall : 0.8807 [top3]
Present Recall : 0.8507
Retention ratio : 96.6%
Decay charge lambda : 0.000329
Half-life : 2107.7 days ← statistically meaningless
Forgetting pace : STABLE
Forgetting regime : EPISODIC
Curve match R-squared : -0.3091 ← the operative quantity
Snapshots logged : 26
Retrain really useful NOW : False
Days till retrain alert : 45.7 ← unreliable in episodic regime
Really useful retrain date : 2026-05-20 ← disregard in episodic regime
Worst single-week drop : Week 7 (−0.1875)
============================================================The strain seen on this report is intentional and essential. The system concurrently experiences Forgetting pace: STABLE (decay charge λ = 0.000329 implies a half-life of two,107 days) and Forgetting regime: EPISODIC (R² = −0.31). Each are right.
The common efficiency throughout 26 weeks is secure — present recall of 0.8507 towards a baseline of 0.8807 offers a retention ratio of 96.6%, comfortably above the 93% retrain threshold. An mixture month-to-month dashboard would present a mannequin working nicely inside acceptable bounds.
The week-to-week behaviour is violently unstable. The worst single shock dropped recall by 18.75 factors in seven days. Three separate weeks dropped beneath 75% recall. These occasions are completely invisible in mixture metrics and completely unpredictable by a decay mannequin with a 2,107-day half-life.
That is the core failure mode of calendar-based retraining: the granularity at which monitoring occurs determines what’s seen. Episodic shocks solely seem at weekly or sub-weekly decision with per-window fraud case counts as high quality weights.

On the Selection of Baseline Technique
One calibration determination materially impacts the diagnostic: how the baseline recall is computed from the primary N qualifying weeks.
Three strategies can be found, every with totally different sensitivity traits:
"mean" — arithmetic imply of the primary N weeks. Acceptable when early-week efficiency is constant. Delicate to warm-up noise when the mannequin has not but encountered the complete take a look at distribution, which is widespread in fraud detection the place label arrival is delayed.
"max" — peak efficiency within the first N weeks. Essentially the most conservative choice: any subsequent drop beneath the historic peak is straight away seen. Danger: a single anomalously good week completely inflates the baseline, producing false retrain alerts for weeks which can be performing usually.
"top3" — imply of the highest three values within the first N weeks. The tactic used on this simulation. It filters warm-up noise whereas preserving proximity to true peak efficiency. Really useful for imbalanced classification issues with delayed labelling.
The selection of baseline_window — what number of weeks are included in baseline estimation — issues equally. Six weeks is the minimal for statistical stability at typical fraud prevalence charges. Fewer than six weeks dangers a baseline dominated by early distributional artefacts.
What This Means for MLOps Observe
The sensible implications break cleanly alongside regime traces.
Within the {smooth} regime, calendar-based retraining is legitimate — however the cadence needs to be derived from the empirical decay charge, not from conference. A mannequin with a half-life of 180 days needs to be retrained each 120 to 150 days. A mannequin with a half-life of 30 days wants weekly retraining. The precise schedule needs to be calibrated to the purpose the place retention falls to the brink — not picked as a result of month-to-month feels affordable.
Within the episodic regime, calendar-based retraining is operationally wasteful. A mannequin that experiences sudden shocks however recovers will set off scheduled retrains throughout secure restoration durations, losing compute and labelling finances, whereas the precise shock occasions — the weeks that matter — happen between scheduled dates and go unaddressed till the subsequent calendar set off.
The substitute is just not extra frequent scheduled retraining. It’s event-driven retraining triggered by the shock detection mechanisms described above: a sudden drop beneath the rolling imply, sustained throughout two consecutive home windows, with affirmation that the drop is just not an information artefact (quantity examine, fraud charge ground, labelling delay indicator).
That is the excellence that the R² diagnostic makes actionable: it tells you which of them toolbox to open.
Limitations
This evaluation has a number of boundaries that needs to be said explicitly.
The dataset is artificial. The Kaggle fraud dataset used right here was generated utilizing the Sparkov simulation software [3] and doesn’t signify actual cardholder or service provider information. The fraud patterns replicate the simulation’s generative mannequin, not precise fraud ring behaviour. The episodic shocks noticed could differ in character from these encountered in stay manufacturing techniques, the place shocks typically contain novel assault vectors that haven’t any prior illustration in coaching information.
A single area. The 2-regime classification framework is proposed primarily based on evaluation of fraud detection information and casual commentary of different domains. A scientific examine throughout a number of manufacturing ML techniques, together with healthcare threat fashions, content material advice engines, and demand forecasting techniques, could be required to validate the R² cutoff of 0.4 as a strong regime boundary.
Label availability assumption. The simulation assumes weekly recall is computable from ground-truth labels accessible inside the similar week. In lots of fraud techniques, confirmed fraud labels arrive with delays of days to weeks as investigations full. The shock detection mechanisms described right here require adaptation for delayed labelling environments — particularly, the rolling window needs to be constructed from label-available transactions solely, and the shock threshold needs to be widened to account for label-arrival variance.
The retrain threshold of seven%. It is a place to begin, not a common advice. The operationally right threshold is a operate of the fee ratio between missed fraud and false alarms in a selected deployment, which varies considerably throughout service provider classes, transaction values, and evaluate group capability.
Reproducing This Evaluation
The whole implementation — ModelForgettingTracker, the fraud simulation, all 4 diagnostic charts, and the stay monitoring dashboard — is out there at .
Necessities:
pip set up pandas numpy scikit-learn matplotlib scipy lightgbmDataset: Obtain fraudTrain.csv and fraudTest.csv from the Kaggle dataset [2] and place them in the identical listing as fraud_forgetting_demo.py.
Run:
python fraud_forgetting_demo.pyTo use the tracker to an present efficiency log with out the Kaggle dependency:
from model_forgetting_curve import load_from_dataframe
tracker = load_from_dataframe(
df,
metric_col="weekly_recall",
metric_name="Recall",
baseline_method="top3",
retrain_threshold=0.07,
)
report = tracker.report()
print(f"Regime: {report.forgetting_regime}")
print(f"R²: {report.fit_r_squared:.3f}")
figs = tracker.plot(save_dir="./charts", dark_mode=True)Conclusion
The Ebbinghaus forgetting curve is a foundational lead to cognitive psychology. As an assumption about manufacturing ML system behaviour, it’s unverified for a complete class of domains the place efficiency is pushed by exterior occasions reasonably than gradual distributional drift.
The R² diagnostic offered here’s a one-time, zero-infrastructure examine that classifies a system’s forgetting regime from present weekly efficiency logs. If R² ≥ 0.4, the exponential mannequin is legitimate and a retraining schedule is the right software. If R² < 0.4, the mannequin is within the episodic regime, the half-life is meaningless, and the retraining schedule needs to be changed with event-driven shock detection.
On 555,000 real-synthetic transactions spanning six months of simulated manufacturing, the fraud detection mannequin returned R² = −0.31. The exponential decay mannequin carried out worse than predicting the imply. The worst shock dropped recall by 18.75 factors in seven days with no aggregate-level sign.
The conclusion is exact: scheduled retraining is a symptom of not understanding which regime you might be in. Run the diagnostic first. Then determine whether or not a schedule is smart in any respect.
Disclosure
The creator has no monetary relationship with Kaggle, the dataset creator, or any of the software program libraries referenced on this article. All instruments used — LightGBM, scikit-learn, SciPy, pandas, NumPy, and Matplotlib — are open-source tasks distributed beneath their respective licences. The dataset used for this evaluation is a publicly accessible artificial dataset distributed beneath the database contents licence (DbCL) on Kaggle [2]. No actual cardholder, service provider, or monetary establishment information was used. The ModelForgettingTracker implementation described and linked on this article is authentic work by the creator, launched beneath the MIT licence.
References
[1] Ebbinghaus, H. (1885). Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot, Leipzig. English translation: Ebbinghaus, H. (1913). Reminiscence: A Contribution to Experimental Psychology (H. A. Ruger & C. E. Bussenius, Trans.). Lecturers School, Columbia College.
[2] Shenoy, Okay. (2020). Credit score Card Transactions Fraud Detection Dataset [Data set]. Kaggle. Retrieved from
Distributed beneath the Database Contents Licence (DbCL) v1.0.
[3] Mullen, B. (2019). Sparkov Information Technology [Software]. GitHub.
[4] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). LightGBM: A extremely environment friendly gradient boosting determination tree. Advances in Neural Info Processing Programs, 30, 3146–3154.
[5] Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G. (2015). Calibrating chance with undersampling for unbalanced classification. 2015 IEEE Symposium Sequence on Computational Intelligence, 159–166.
[6] Nagelkerke, N. J. D. (1991). A word on a basic definition of the coefficient of dedication. Biometrika, 78(3), 691–692.
[7] Tsymbal, A. (2004). The issue of idea drift: Definitions and associated work (Technical Report TCD-CS-2004-15). Division of Pc Science, Trinity School Dublin.
[8] Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on idea drift adaptation. ACM Computing Surveys, 46(4), 44:1–44:37.
[9] Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Studying beneath idea drift: A evaluate. IEEE Transactions on Information and Information Engineering, 31(12), 2346–2363.
[10] Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Vivid, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, Okay. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Elementary algorithms for scientific computing in Python. Nature Strategies, 17, 261–272.
[11] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesneau, É. (2011). Scikit-learn: Machine studying in Python. Journal of Machine Studying Analysis, 12, 2825–2830.
[12] Hunter, J. D. (2007). Matplotlib: A 2D graphics surroundings. Computing in Science & Engineering, 9(3), 90–95.



