If your model states “there is a 40% chance this customer will leave” and your code runs predict(X) >= 0.5, you have just set a pricing rule: you have decided that the cost of offering a retention incentive to someone who would have stayed is exactly equal to the cost of losing a customer who was about to leave. Based on the IBM Telco dataset (perhaps the most frequently reused churn dataset on Kaggle and GitHub), that rule is off by a factor of 13.
I compiled a set of 36 publicly available IBM Telco churn studies (Kaggle notebooks, GitHub repos, blog posts, peer-reviewed papers), and the reporting trend is clear: roughly nine out of ten report classification accuracy or F1, just over 14% include a profit curve, and none apply survival analysis to estimate lifetime value.
The outcome is a body of work in which the same dataset has been reworked hundreds of times, yet every default-threshold model sacrifices value: approximately $86 per customer in preventable loss on the standard 20% test split. Extrapolating to a 100,000-subscriber base with a comparable churn pattern that translates to $8.6 million in recoverable expense. The IBM Telco churn rate (26.5% per year) is notably elevated; a more typical B2C SaaS base with 5–8% annual churn would see the per-customer figure decline by about 3–4×, so what remains unchanged across any cost-sensitive situation is not the headline dollar figure but the imbalance — missing a churner is 13× more costly than over-retaining a loyalist.
This article covers three topics, in sequence: first, what the IBM Telco literature reports and what it omits; second, how to calculate the dollar cost of a misclassification using public 2026 B2C SaaS benchmarks and Kaplan-Meier survival analysis, without hand-waving CAC; third, why the textbook Bayes-optimal threshold formula falls short of a brute-force sweep when the model is trained on SMOTE-balanced data, and how to address it.
Every figure in this article can be reproduced from the scripts linked at the end.
1. The 36-article gap
The IBM Telco Customer Churn dataset is compact (7,032 cleaned rows), well-structured, labeled, and has served as the go-to introductory churn dataset on Kaggle for close to a decade.
To understand what the public corpus actually evaluates, I cataloged 36 analyses from Kaggle, GitHub, and major data-science blogs, rating each on ten reporting dimensions ranging from F1 score to a CAC-and-LTV-grounded profit curve.
The trend that surfaced is shown in Figure 2.

Three takeaways stand out:
- Saturated: F1, accuracy, AUC, confusion-matrix screenshots, and SMOTE-vs-no-SMOTE comparisons appear in 80–90% of the corpus, with hyperparameter tuning via Optuna or grid search a near-universal fixture.
- Uncommon: a profit curve (total dollar cost of misclassifications as a function of decision threshold) shows up in fewer than 15% of the analyses reviewed, and when it does, the FN/FP cost figures are typically borrowed from a textbook example without grounding them in real CAC or LTV.
- Absent: none of the 36 analyses cataloged compute customer lifetime value through survival analysis on
tenure; most either skip LTV entirely or rely on the steady-state Skok formulaLTV = ARPU / monthly_churn_rate, which presumes a uniform customer base — a bold assumption for a dataset where contract type, payment method, and tenure all meaningfully affect retention.
Omitting survival analysis is consequential because the threshold choice depends on LTV: if you misestimate LTV by 2× you misestimate the cost of a missed churner by 2×, and the cost-optimal threshold shifts accordingly.
The next two sections construct the missing component, then feed it back into the threshold problem.
2. The cost of an error, in dollars
Three figures determine the dollar cost of every prediction (ARPU, gross margin, and CAC); two are drawn directly from the dataset, one from public 2026 industry benchmarks.
import pandas as pd
df = pd.read_csv("telco.csv")
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna().reset_index(drop=True)
arpu = df["MonthlyCharges"].mean() # $64.80
mean_tenure = df["tenure"].mean() # 32.42 months
churn_rate = (df["Churn"] == "Yes").mean() # 26.58 %
realised_ltv_churned = df.loc[df["Churn"] == "Yes",
"TotalCharges"].mean() # $1,531.80ARPU and tenure come straight from the dataset: average monthly charge is $64.80, average observed tenure is 32.4 months, and the realised LTV of churners is $1,531.80 (simply the average TotalCharges across customers who already left). Holding ARPU constant, three common LTV approaches yield very different upper bounds:

None of these three is the correct answer, and the first one is fundamentally flawed. ARPU × mean_tenure is the approach most tutorials default to, and it is broken at its core. ARPU is not an attribute of a customer; it is the combined result of every feature that also influences churn — contract type, payment method, product bundle, household composition. The revenue departs with the customer, so ARPU and tenure are not independent measures, and the textbook breakdown LTV ≈ E[ARPU] × E[lifetime] only holds when Cov(ARPU, lifetime) = 0. In any churn dataset worth modeling that covariance is non-zero — if it were zero, the model would have nothing to predict — and a customer with $80/month in Monthly Charges and 8 months of tenure is not a “high-revenue customer”; their $80 and their 8 months are two reflections of the same underlying risk profile.
Layer on the fact that ARPU varies within a single customer’s lifetime — a promo-onboarding cohort paying $30/month for the first three months and $80/month thereafter, a long-tenured loyalty customer grandfathered into a $50/month rate while new customers pay $90/month for the same product — and multiplying the mean of one distribution by the mean of another describes a customer who exists nowhere in the data.
The Skok formula at least uses a steady-state churn rate, but it assumes that rate is constant forever. The realised LTV is a real number, but it only describes the customers you have already lost.
None of the three tells you when a new customer breaks even on acquisition cost — for that you need a real retention curve, which is the next section, but first a word on CAC.
For B2C SaaS in the 2026 benchmarks, CAC ranges from about $68 (eCommerce) to over $200 (fintech), with mid-market subscription products clustering around $150 ([1], [2]); telecom subscriber acquisition cost is materially higher ($300+ once handset subsidies are amortised), so $150 is a conservative anchor for this dataset, and picking a higher number would only make the burn calculation in Section 4 larger.

Gross margin for B2C SaaS sits in the 70–85% band, with 75% as the usual midpoint that matches David Skok’s modelling assumptions for steady-state SaaS economics ([3]).
That gives us the building blocks for the cost of a single prediction error.
CAC = 150
ARPU = 64.80
REMAINING_TENURE_MONTHS = 18
FN_COST = CAC + ARPU * REMAINING_TENURE_MONTHS # $1,316.40
FP_COST = 100 # typical campaign cost (midpoint)
ratio = FN_COST / FP_COST # 13.2 : 1A false negative (telling a customer they will stay when they actually leave) costs you the new acquisition spend ($150) plus 18 months of foregone revenue ($64.80 × 18 = $1,166.40), for $1,316.40 total, while a false positive (flagging a customer as a churn risk when they were going to stay) costs roughly $100 of campaign and discount expense, leaving a cost ratio of 13.2 to 1.
A note on what the framework is and is not. The $150 CAC and the $100 false-positive cost in this article are placeholders; CAC varies materially by acquisition channel, and the $100 is shorthand for whatever your real retention intervention costs — a discount, a CSM call, a bundle upgrade, a product investigation. None of these are interchangeable, and a blanket discount is not a retention strategy: it is a deferral mechanism that retains customers only until the discount expires while paying to retain customers who were never going to leave (and, worse, training them to expect the next discount). Real retention strategy maps churn drivers — a customer leaving because backup_online keeps failing is retained by fixing backup_online, not by a 10% off email — and allocates budget toward product improvement, with the campaign cost as a short-term bridge while engineering catches up. The profit curve here is a threshold-setting tool that operates after you have decided your retention playbook (what intervention applies to whom, at what real cost); it is not a substitute for that decision. Treat the $150 and the $100 as a single representative pair; segment them, and the framework segments with them.
That ratio is the entire reason threshold = 0.5 is the wrong default: the decision boundary should reflect the asymmetry, and we will get to the exact formula, but first comes the LTV piece.
3. The LTV profit curve
Most churn writeups treat lifetime value as a static dollar number you multiply by a hazard rate.
Survival analysis does better.
It measures retention directly from the data and turns LTV into a curve: the cumulative contribution margin per customer as a function of months since acquisition, starting at −CAC on day zero (you’ve paid to acquire the customer and earned nothing) and climbing as each surviving month adds ARPU × gross_margin × P(still alive) to the balance.
The Kaplan-Meier estimator does the heavy lifting, with tenure as the duration and Churn == "Yes" as the event, producing the overall curve in Figure 4.
from lifelines import KaplanMeierFitter
import numpy as np
CAC, GROSS_MARGIN, HORIZON = 150, 0.75, 72
kmf = KaplanMeierFitter().fit(df["tenure"], (df["Churn"] == "Yes").astype(int))
months = np.arange(0, HORIZON + 1)
S = kmf.survival_function_at_times(months).values
monthly = ARPU * GROSS_MARGIN * S
ltv_curve = np.cumsum(monthly) - CAC
ltv_curve[0] = -CAC # day zero, only CAC is sunk
Three key takeaways from the curve:
- Breakeven at month 3 (in expectation, not per customer): across the original acquired cohort, the survival-weighted cumulative contribution covers the
$150acquisition cost by month 3, and CAC payback under twelve months is the David Skok rule of thumb that Telco beats by a factor of four. This is the right number for budgeting cohort-level retention spend, but it is a cohort average that hides bimodal variance: a customer who churns in month 1 individually contributes one month of margin ($48.60) and never recoups their CAC, while a 70-month survivor contributes well over$3,400. The Kaplan-Meier weighting bakes those early losses in correctly — they just do not get a star on the curve. - LTV at the 72-month horizon ≈ $2,527 per customer: combined with the
$150CAC, that is an LTV:CAC ratio of about 17.8:1, well above the 3:1 floor most SaaS investors look for, and a useful sanity check that the dataset describes a healthy unit-economics business rather than a death-spiral one. - Churn-reductionUplift is modest at the cohort level: a 10% reduction in churn increases terminal LTV by approximately 2.8%, and a 20% reduction raises it by about 5.7%. So while the improvement is real, it’s not dramatic — and the decision about which customers to acquire matters more than the retention effort itself.
Breaking down the same calculation by Contract (Figure 5) is where this framework truly proves its value.

A customer on a two-year contract is worth roughly $3,372 over the 72-month horizon, compared to $1,620 for a month-to-month customer — less than half — despite identical ARPU and CAC. The entire difference comes down to retention. From a marketing-spend standpoint, the customer on the right side of the chart (the one worth paying more to acquire) is the contract-locked subscriber, even though they may appear “less profitable per month” in any single-period snapshot.
This is exactly the kind of decision the standard IBM Telco analysis can’t support, because it never calculates survival-conditional LTV in the first place.
4. The classification profit curve
With FN cost, FP cost, and survival-based LTV all quantified, the threshold question simplifies to a one-dimensional optimization: train a model, generate predicted probabilities on the test set, sweep the decision threshold from 0 to 1, compute the total dollar cost at each threshold, and select the one that minimizes it.
The model used here is a tuned XGBoost trained with SMOTE on the training fold only — the standard Telco recipe.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pre_sampling import StandardScaler
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
y = (df["Churn"] == "Yes").astype(int)
X = pd.get_dummies(df.drop(columns=["customerID", "Churn"]),
drop_first=True).astype(float)
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.20, stratify=y, random_state=42)
scaler = StandardScaler().fit(X_tr)
X_tr_s, X_te_s = scaler.transform(X_tr), scaler.transform(X_te)
X_tr_b, y_tr_b = SMOTE(random_state=42).fit_resample(X_tr_s, y_tr)
model = XGBClassifier(n_estimators=400, max_depth=5,
learning_rate=0.05,
subsample=0.9, colsample_bytree=0.9,
random_state=42, eval_metric="logloss")
model.fit(X_tr_b, y_tr_b)
probs = model.predict_proba(X_te_s)[:, 1]
thresholds = np.linspace(0.01, 0.99, 99)
totals = []
for t in thresholds:
pred = (probs >= t).astype(int)
tn, fp, fn, tp = confusion_matrix(y_te, pred).ravel()
totals.append(fn * 1316.40 + fp * 100)The result appears in Figure 6.

The numbers, based on a 1,407-row test set:

Shifting from the 0.5 default to the empirical minimum recovers $121,160 on the test set, which works out to $86.11 per customer. Applied to a 100,000-subscriber book, that produces the headline figure of $8.6M. Your actual results will depend on your CAC, ARPU, and retention curve, but the key driver of the gap is the 13× cost asymmetry — it’s far more expensive to miss a churner than to over-treat a loyal customer.
When the textbook formula loses to the threshold sweep
Pick up any cost-sensitive classification reference (Provost and Fawcett’s Data Science for Business is the canonical source [4]) and you’ll find the Bayes-optimal threshold formula:
t* = C_FP / (C_FP + C_FN)
Plugging in our cost ratio: t* = 100 / (100 + 1316.40) ≈ 0.0706. The math is correct, and on a model with well-calibrated probabilities, that threshold minimizes expected cost. But the sweep yields t = 0.03, and at that threshold the test-set cost is $8,661 lower than at 0.07. So where does the gap come from?
The Bayes-optimal formula assumes the model’s predicted probabilities are calibrated: a prediction of 0.5 should correspond to a 50% true churn probability. However, our model is trained on a SMOTE-balanced set, which inflates the minority class to 50% during training. Tree-based learners then output probabilities biased toward higher values, meaning the model’s “0.07” maps to roughly the true 0.03 in calibrated probability space. The textbook formula isn’t wrong — it’s being applied to an out-of-spec input.
There are two clean fixes:
- Calibrate the probabilities first: apply Platt scaling or isotonic regression on a held-out set, then apply the Bayes-optimal threshold to the calibrated output. scikit-learn’s
CalibratedClassifierCVhandles this in a single line. - Skip calibration and just sweep: it’s computationally cheap, it tolerates calibration drift, and on a small dataset like Telco the test-set sweep is more reliable than a calibration model fit on a few hundred held-out rows.
In practice, for production systems with regular retraining, the sweep is what most teams deploy. The formula is the right thing to teach (with calibration as the caveat), and both should appear in any honest writeup of a cost-sensitive churn model.
Neither one shows up in the IBM Telco corpus I indexed.
5. What the next IBM Telco article should report
Three concrete shifts would make the next 36 IBM Telco analyses more useful than the last 36:
Report a profit curve, not a confusion matrix. F1 at threshold 0.5 is a tournament metric — useful for ranking models when you have to pick one, useless
The curve in Figure 6 contains more actionable insight for decision-making than every accuracy benchmark in the entire dataset put together.
Use survival analysis to estimate LTV, not steady-state assumptions. A Kaplan-Meier fit on tenure takes just 30 lines of Python. From it, you get the breakeven point, the LTV at a given horizon, and a contract-segmented retention curve — giving marketing teams a concrete retention budget and a defensible answer to “which customers should we invest more in acquiring?” The Skok formula still works well as a quick sanity check, but it shouldn’t be your primary LTV estimate.
Be transparent about calibration when reporting a Bayes-optimal threshold. Either calibrate your model first, or clearly state that the threshold shown is the empirical minimum found through a parameter sweep. Wang et al. ([5]) make a similar point using a more sophisticated metric (e-Profits) that applies survival analysis throughout, but the underlying principle is the same.
Segment the intervention, not just the prediction score. A profit curve assumes a single false-positive cost (the expense of acting on a false alarm). In reality, the most cost-effective action differs by customer type: a bundle upgrade for a high-value loyalist, a price-sensitivity review for an at-risk new account, and no action for a third group. Segment-specific FP costs and segment-specific thresholds are the natural next step beyond the framework presented here.
I started this project expecting the gap to be in the modeling itself. It wasn’t.
The IBM Telco dataset has been exhaustively studied for predictive accuracy. What it can still teach us is whether our pipelines produce good decisions — not just accurate predictions.
That requires three things: dollar-denominated error costs, real customer retention curves, and an honest classifier threshold. Four scripts and a Kaplan-Meier fit are all you need to get there.
References
[1] Genesys Growth Marketing, Customer Acquisition Cost Benchmarks: 44 Statistics Every Marketing Leader Should Know in 2026 (2026), genesysgrowth.com.
[2] Proven SaaS, CAC Payback Benchmarks 2026: SaaS Customer Acquisition Cost (2026), proven-saas.com.
[3] D. Skok, SaaS Metrics 2.0: Detailed Definitions (2014, updated 2024), forentrepreneurs.com.
[4] F. Provost and T. Fawcett, Data Science for Business (2013), O’Reilly Media, ch. 7–8.
[5] Y. Wang, S. Albrecht, et al., e-Profits: A Business-Aligned Evaluation Metric for Profit-Sensitive Customer Churn Prediction (2025), arXiv:2507.08860.
[6] C. Davidson-Pilon, lifelines: survival analysis in Python (2019), Journal of Open Source Software 4(40), 1317.
[7] W. Verbeke, T. Verdonck and S. Maldonado, Profit-driven decision trees for churn prediction (2018), European Journal of Operational Research, 284(3).
[8] N. El Attar and M. El-Hajj, A systematic review of customer churn prediction approaches in telecommunications (2026), Frontiers in Artificial Intelligence.
Code, data, and reproducible scripts for every figure are available on request. The dataset is the IBM Telco Customer Churn dataset, fully synthetic sample data published by IBM in its official repository (github.com/IBM/telco-customer-churn-on-icp4d) under the Apache License 2.0, which permits use, derivative analysis, and publication with attribution. The data is synthetic and contains no real customers or PII.
Thank you for reading! If you have any questions or would like to connect, feel free to reach out to me on LinkedIn. 👋



