Picture by Creator
# Introduction
When making use of for a job at Meta (previously Fb), Apple, Amazon, Netflix, or Alphabet (Google) — collectively often called FAANG — interviews hardly ever check whether or not you possibly can recite textbook definitions. As an alternative, interviewers need to see whether or not you analyze knowledge critically and whether or not you’ll establish a foul evaluation earlier than it ships to manufacturing. Statistical traps are probably the most dependable methods to check that.
![]()
These pitfalls replicate the varieties of choices that analysts face every day: a dashboard quantity that appears effective however is definitely deceptive, or an experiment outcome that appears actionable however incorporates a structural flaw. The interviewer already is aware of the reply. What they’re watching is your thought course of, together with whether or not you ask the precise questions, discover lacking info, and push again on a quantity that appears good at first sight. Candidates stumble over these traps repeatedly, even these with sturdy mathematical backgrounds.
We’ll look at 5 of the most typical traps.
# Understanding Simpson’s Paradox
This entice goals to catch individuals who unquestioningly belief aggregated numbers.
Simpson’s paradox occurs when a pattern seems in several teams of knowledge however vanishes or reverses when combining these teams. The basic instance is UC Berkeley’s 1973 admissions knowledge: total admission charges favored males, however when damaged down by division, girls had equal or higher admission charges. The mixture quantity was deceptive as a result of girls utilized to extra aggressive departments.
The paradox is inevitable every time teams have completely different sizes and completely different base charges. Understanding that’s what can separate a surface-level reply from a deep one.
In interviews, a query may appear to be this: “We ran an A/B test. Overall, variant B had a higher conversion rate. However, when we break it down by device type, variant A performed better on both mobile and desktop. What is happening?” A powerful candidate refers to Simpson’s paradox, clarifies its trigger (group proportions differ between the 2 variants), and asks to see the breakdown fairly than belief the combination determine.
Interviewers use this to test whether or not you instinctively ask about subgroup distributions. In case you simply report the general quantity, you’ve gotten misplaced factors.
// Demonstrating With A/B Check Knowledge
Within the following demonstration utilizing Pandas, we are able to see how the combination charge might be deceptive.
import pandas as pd
# A wins on each gadgets individually, however B wins in mixture
# as a result of B will get most visitors from higher-converting cell.
knowledge = pd.DataFrame({
'system': ['mobile', 'mobile', 'desktop', 'desktop'],
'variant': ['A', 'B', 'A', 'B'],
'converts': [40, 765, 90, 10],
'guests': [100, 900, 900, 100],
})
knowledge['rate'] = knowledge['converts'] / knowledge['visitors']
print('Per system:')
print(knowledge[['device', 'variant', 'rate']].to_string(index=False))
print('nAggregate (deceptive):')
agg = knowledge.groupby('variant')[['converts', 'visitors']].sum()
agg['rate'] = agg['converts'] / agg['visitors']
print(agg['rate'])
Output:

# Figuring out Choice Bias
This check lets interviewers assess whether or not you concentrate on the place knowledge comes from earlier than analyzing it.
Choice bias arises when the information you’ve gotten shouldn’t be consultant of the inhabitants you are trying to know. As a result of the bias is within the knowledge assortment course of fairly than within the evaluation, it’s easy to miss.
Contemplate these doable interview framings:
- We analyzed a survey of our customers and located that 80% are glad with the product. Does that inform us our product is sweet? A strong candidate would level out that glad customers are extra probably to reply to surveys. The 80% determine in all probability overstates satisfaction since sad customers probably selected to not take part.
- We examined prospects who left final quarter and found they primarily had poor engagement scores. Ought to our consideration be on engagement to cut back churn? The issue right here is that you just solely have engagement knowledge for churned customers. You wouldn’t have engagement knowledge for customers who stayed, which makes it unattainable to know if low engagement truly predicts churn or whether it is only a attribute of churned customers on the whole.
A associated variant value realizing is survivorship bias: you solely observe the outcomes that made it via some filter. In case you solely use knowledge from profitable merchandise to investigate why they succeeded, you might be ignoring people who failed for a similar causes that you’re treating as strengths.
// Simulating Survey Non-Response
We will simulate how non-response bias skews outcomes utilizing NumPy.
import numpy as np
import pandas as pd
np.random.seed(42)
# Simulate customers the place glad customers usually tend to reply
satisfaction = np.random.selection([0, 1], dimension=1000, p=[0.5, 0.5])
# Response likelihood: 80% for glad, 20% for unhappy
response_prob = np.the place(satisfaction == 1, 0.8, 0.2)
responded = np.random.rand(1000) < response_prob
print(f"True satisfaction rate: {satisfaction.mean():.2%}")
print(f"Survey satisfaction rate: {satisfaction[responded].mean():.2%}")
Output:
![]()
Interviewers use choice bias inquiries to see when you separate “what the data shows” from “what is true about users.”
# Stopping p-Hacking
p-hacking (additionally referred to as knowledge dredging) occurs while you run many checks and solely report those with ( p < 0.05 ).
The problem is that ( p )-values are solely meant for particular person checks. One false constructive can be anticipated by likelihood alone if 20 checks have been run at a 5% significance degree. The false discovery charge is elevated by fishing for a big outcome.
An interviewer may ask you the next: “Last quarter, we conducted fifteen feature experiments. At ( p < 0.05 ), three were found to be significant. Do all three need to be shipped?” A weak reply says sure.
A powerful reply would firstly ask what the hypotheses have been earlier than the checks have been run, if the importance threshold was set upfront, and whether or not the workforce corrected for a number of comparisons.
The follow-up usually includes how you’ll design experiments to keep away from this. Pre-registering hypotheses earlier than knowledge assortment is probably the most direct repair, because it removes the choice to determine after the very fact which checks have been “real.”
// Watching False Positives Accumulate
We will observe how false positives happen by likelihood utilizing SciPy.
import numpy as np
from scipy import stats
np.random.seed(0)
# 20 A/B checks the place the null speculation is TRUE (no actual impact)
n_tests, alpha = 20, 0.05
false_positives = 0
for _ in vary(n_tests):
a = np.random.regular(0, 1, 1000)
b = np.random.regular(0, 1, 1000) # equivalent distribution!
if stats.ttest_ind(a, b).pvalue < alpha:
false_positives += 1
print(f'Assessments run: {n_tests}')
print(f'False positives (p<0.05): {false_positives}')
print(f'Anticipated by likelihood alone: {n_tests * alpha:.0f}')
Output:
![]()
Even with zero actual impact, ~1 in 20 checks clears ( p < 0.05 ) by likelihood. If a workforce runs 15 experiments and stories solely the numerous ones, these outcomes are probably noise.
It’s equally essential to deal with exploratory evaluation as a type of speculation era fairly than affirmation. Earlier than anybody takes motion primarily based on an exploration outcome, a confirmatory experiment is required.
# Managing A number of Testing
This check is intently associated to p-hacking, however it’s value understanding by itself.
The a number of testing drawback is the formal statistical challenge: while you run many speculation checks concurrently, the likelihood of at the very least one false constructive grows shortly. Even when the therapy has no impact, you need to anticipate roughly 5 false positives when you check 100 metrics in an A/B check and declare something with ( p < 0.05 ) as important.
The corrections for this are well-known: Bonferroni correction (divide alpha by the variety of checks) and Benjamini-Hochberg (controls the false discovery charge fairly than the family-wise error charge).
Bonferroni is a conservative method: for instance, when you check 50 metrics, your per-test threshold drops to 0.001, making it more durable to detect actual results. Benjamini-Hochberg is extra applicable when you’re prepared to just accept some false discoveries in change for extra statistical energy.
In interviews, this comes up when discussing how an organization tracks experiment metrics. A query is likely to be: “We monitor 50 metrics per experiment. How do you decide which ones matter?” A strong response discusses pre-specifying major metrics previous to the experiment’s execution and treating secondary metrics as exploratory whereas acknowledging the difficulty of a number of testing.
Interviewers are looking for out in case you are conscious that taking extra checks leads to extra noise fairly than extra info.
# Addressing Confounding Variables
This entice catches candidates who deal with correlation as causation with out asking what else may clarify the connection.
A confounding variable is one which influences each the impartial and dependent variables, creating the phantasm of a direct relationship the place none exists.
The basic instance: ice cream gross sales and drowning charges are correlated, however the confounder is summer time warmth; each go up in heat months. Performing on that correlation with out accounting for the confounder results in unhealthy selections.
Confounding is especially harmful in observational knowledge. In contrast to a randomized experiment, observational knowledge doesn’t distribute potential confounders evenly between teams, so variations you see may not be attributable to the variable you might be learning in any respect.
A standard interview framing is: “We noticed that users who use our mobile app more tend to have significantly higher revenue. Should we push notifications to increase app opens?” A weak candidate says sure. A powerful one asks what sort of consumer opens the app regularly to start with: probably probably the most engaged, highest-value customers.
Engagement drives each app opens and spending. The app opens aren’t inflicting income; they’re a symptom of the identical underlying consumer high quality.
Interviewers use confounding to check whether or not you distinguish correlation from causation earlier than drawing conclusions, and whether or not you’ll push for randomized experimentation or propensity rating matching earlier than recommending motion.
// Simulating A Confounded Relationship
import numpy as np
import pandas as pd
np.random.seed(42)
n = 1000
# Confounder: consumer high quality (0 = low, 1 = excessive)
user_quality = np.random.binomial(1, 0.5, n)
# App opens pushed by consumer high quality, not impartial
app_opens = user_quality * 5 + np.random.regular(0, 1, n)
# Income additionally pushed by consumer high quality, not app opens
income = user_quality * 100 + np.random.regular(0, 10, n)
df = pd.DataFrame({
'user_quality': user_quality,
'app_opens': app_opens,
'income': income
})
# Naive correlation appears to be like sturdy — deceptive
naive_corr = df['app_opens'].corr(df['revenue'])
# Inside-group correlation (controlling for confounder) is close to zero
corr_low = df[df['user_quality']==0]['app_opens'].corr(df[df['user_quality']==0]['revenue'])
corr_high = df[df['user_quality']==1]['app_opens'].corr(df[df['user_quality']==1]['revenue'])
print(f"Naive correlation (app opens vs revenue): {naive_corr:.2f}")
print(f"Correlation controlling for user quality:")
print(f" Low-quality users: {corr_low:.2f}")
print(f" High-quality users: {corr_high:.2f}")
Output:
Naive correlation (app opens vs income): 0.91
Correlation controlling for consumer high quality:
Low-quality customers: 0.03
Excessive-quality customers: -0.07
The naive quantity appears to be like like a powerful sign. When you management for the confounder, it disappears solely. Interviewers who see a candidate run this sort of stratified test (fairly than accepting the combination correlation) know they’re speaking to somebody who is not going to ship a damaged advice.
# Wrapping Up
All 5 of those traps have one thing in frequent: they require you to decelerate and query the information earlier than accepting what the numbers appear to point out at first look. Interviewers use these eventualities particularly as a result of your first intuition is usually improper, and the depth of your reply after that first intuition is what separates a candidate who can work independently from one who wants path on each evaluation.

None of those concepts are obscure, and interviewers inquire about them as a result of they’re typical failure modes in actual knowledge work. The candidate who acknowledges Simpson’s paradox in a product metric, catches a variety bias in a survey, or questions whether or not an experiment outcome survived a number of comparisons is the one who will ship fewer unhealthy selections.
In case you go into FAANG interviews with a reflex to ask the next questions, you might be already forward of most candidates:
- How was this knowledge collected?
- Are there subgroups that inform a distinct story?
- What number of checks contributed to this outcome?
Past serving to in interviews, these habits may also stop unhealthy selections from reaching manufacturing.
Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from prime firms. Nate writes on the newest developments within the profession market, provides interview recommendation, shares knowledge science tasks, and covers the whole lot SQL.



