who has written a youngsters’s e-book and launched it in two variations on the identical time into the market on the identical worth. One model has a primary cowl design, whereas the opposite has a high-quality cowl design, which in fact price him extra.
He then observes the gross sales for a sure interval and gathers the information proven beneath.
Now he involves us and desires to know whether or not the quilt design of his books has affected their gross sales.
From the gross sales knowledge, we are able to observe that there are two categorical variables. The primary is canopy sort, which is both excessive price or low price, and the second is gross sales consequence, which is both bought or not bought.
Now we need to know whether or not these two categorical variables are associated or not.
We all know that when we have to discover a relationship between two categorical variables, we use the Chi-square check for independence.
On this situation, we’ll typically use Python to use the Chi-square check and calculate the chi-square statistic and p-value.
Code:
import numpy as np
from scipy.stats import chi2_contingency
# Noticed knowledge
noticed = np.array([
[320, 180],
[350, 150]
])
chi2, p, dof, anticipated = chi2_contingency(noticed, correction=False)
print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies:n", anticipated)End result:

The chi-square statistic is 4.07 with a p-value of 0.043 which is beneath the 0.05 threshold. This implies that the quilt sort and gross sales are statistically related.
Now we have now obtained the p-value, however earlier than treating it as a call, we have to perceive how we received this worth and what the assumptions of this check are.
Understanding this can assist us determine whether or not the consequence we obtained is dependable or not.
Now let’s attempt to perceive what the Chi-Sq. check truly is.
Now we have this knowledge.

By observing the information, we are able to say that gross sales for books with the high-cost cowl are larger, so we might imagine that the quilt labored.
Nevertheless, in actual life, the numbers fluctuate by probability even when the quilt has no impact or prospects choose books randomly. We will nonetheless get unequal values.
Randomness all the time creates imbalances.
Now the query is, “Is this difference bigger than what randomness usually creates?”
Let’s see how Chi-Sq. check solutions that query.
We have already got this method to calculate the Chi-Sq. statistic.
[
chi^2 = sum_{i=1}^{r} sum_{j=1}^{c}
frac{(O_{ij} – E_{ij})^2}{E_{ij}}
]
the place:
χ² is the Chi-Sq. check statistic
i represents the row index
j represents the column index
Oᵢⱼ is the noticed rely in row i and column j
Eᵢⱼ is the anticipated rely in row i and column j
First let’s concentrate on Anticipated Counts.
Earlier than understanding what anticipated counts are, let’s state the speculation for our check.
Null Speculation (H₀)
The quilt sort and gross sales consequence are unbiased. (The quilt sort has no impact)
Various Speculation (H₁)
The quilt sort and gross sales consequence will not be unbiased. (The quilt sort is related to whether or not a e-book is bought.)
Now what will we imply by anticipated counts?
Let’s say the null speculation is true, which suggests the quilt sort has no impact on the gross sales of books.
Let’s return to possibilities.
As we already know, the method for easy chance is:
[P(A) = frac{text{Number of favorable outcomes}}{text{Total number of outcomes}}]
In our knowledge, the general chance of a e-book being bought is:
[P(text{Sold}) = frac{text{Number of books sold}}{text{Total number of books}} = frac{670}{1000} = 0.67]
In chance, after we write P(A∣B), we imply the chance of occasion A provided that occasion B has already occurred.
[
text{Under independence, cover type and sales are not related.}
text{This means the probability of being sold does not depend on cover type.}
text{which means}
P(text{Sold} mid text{Low-cost cover}) = P(text{Sold})
P(text{Sold} mid text{High-cost cover}) = P(text{Sold})
P(text{Sold}) = frac{670}{1000} = 0.67
text{Therefore, }
P(text{Sold} mid text{Low-cost cover}) = 0.67
]
Underneath independence, we’ve got P (Offered | Low-cost Cowl) = 0.67, which suggests 67% of low-cost cowl books are anticipated to be bought.
Since we’ve got 500 books with low-cost covers, we convert this chance into an anticipated variety of bought books.
[0.67 times 500 = 335]
This implies we anticipate 335 low-cost cowl books to be bought below independence.
Primarily based on our knowledge desk, we are able to characterize this as E11.
Equally, the anticipated worth for the high-cost cowl and bought can be 335, which is represented by E21.
Now let’s calculate E12 – Low-cost cowl, Not Offered and E22 – Excessive-cost cowl, Not Offered.
The general chance of a e-book not being bought is:
[P(text{Not Sold}) = frac{330}{1000} = 0.33]
Underneath independence, this chance applies to every sub group as earlier.
[P(text{Not Sold} mid text{Low-cost cover}) = 0.33]
[P(text{Not Sold} mid text{High-cost cover}) = 0.33]
Now we convert this chance into the anticipated rely of unsold books.
[E_{12} = 0.33 times 500 = 165]
[E_{22} = 0.33 times 500 = 165]
We used possibilities right here to know the concept of anticipated counts, however we have already got direct formulation to calculate them. Let’s additionally check out these.
System to calculate Anticipated Counts:
[E_{ij} = frac{R_i times C_j}{N}]
The place:
- Ri = Row whole
- Cj = Column whole
- N = Grand whole
Low-cost cowl, Offered:
[E_{11} = frac{500 times 670}{1000} = 335]
Low-cost cowl, Not Offered:
[E_{12} = frac{500 times 330}{1000} = 165]
Excessive-cost cowl, Offered:
[E_{12} = frac{500 times 670}{1000} = 335]
Excessive-cost cowl, Not Offered:
[E_{22} = frac{500 times 330}{1000} = 165]
In each methods, we get the identical values.
By calculating anticipated counts, what we’re discovering is that this: if we assume the null speculation is true, then the 2 categorical variables are unbiased.
Right here, we’ve got 1,000 books and we all know that 670 are bought. Now we think about randomly choosing books and labeling them as bought.
After deciding on 670 books, we test what number of of them belong to the low-cost cowl group and what number of belong to the high-cost cowl group.
If we repeat this course of many instances, we’d receive values round 335. Generally they could be 330 or 340.
We then take into account the typical, and 335 turns into the central level of the distribution if all the things occurs purely resulting from randomness.
This doesn’t imply the rely should equal 335, however that 335 represents the pure middle of variation below independence.
The Chi-Sq. check then measures how far the noticed rely deviates from this central worth relative to the variation anticipated below randomness.
We calculated the anticipated counts:
E11 = 335; E21 = 335; E12 = 165; E22 = 165

The subsequent step is to calculate the deviation between the noticed and anticipated counts. To do that, we subtract the anticipated rely from the noticed rely.
start{aligned}
textual content{Low-Price Cowl & Offered:} quad & O – E = 320 – 335 = -15 [8pt]
textual content{Low-Price Cowl & Not Offered:} quad & O – E = 180 – 165 = 15 [8pt]
textual content{Excessive-Price Cowl & Offered:} quad & O – E = 350 – 335 = 15 [8pt]
textual content{Excessive-Price Cowl & Not Offered:} quad & O – E = 150 – 165 = -15
finish{aligned}
Within the subsequent step, we sq. the variations as a result of if we add the uncooked deviations, the constructive and unfavourable values cancel out, leading to zero.
This may incorrectly counsel that there isn’t a imbalance. Squaring solves the cancellation drawback by permitting us to measure the magnitude of the imbalance, no matter path.
start{aligned}
textual content{Low-Price Cowl & Offered:} quad & (O – E)^2 = (-15)^2 = 225 [6pt]
textual content{Low-Price Cowl & Not Offered:} quad & (15)^2 = 225 [6pt]
textual content{Excessive-Price Cowl & Offered:} quad & (15)^2 = 225 [6pt]
textual content{Excessive-Price Cowl & Not Offered:} quad & (-15)^2 = 225
finish{aligned}
Now that we’ve got calculated the squared deviations for every cell, the following step is to divide them by their respective anticipated counts.
This standardizes the deviations by scaling them relative to what was anticipated below the null speculation.
start{aligned}
textual content{Low-Price Cowl & Offered:} quad & frac{(O – E)^2}{E} = frac{225}{335} = 0.6716 [6pt]
textual content{Low-Price Cowl & Not Offered:} quad & frac{225}{165} = 1.3636 [6pt]
textual content{Excessive-Price Cowl & Offered:} quad & frac{225}{335} = 0.6716 [6pt]
textual content{Excessive-Price Cowl & Not Offered:} quad & frac{225}{165} = 1.3636
finish{aligned}
Now, for each cell, we’ve got calculated:
start{aligned}
frac{(O – E)^2}{E}
finish{aligned}
Every of those values represents the standardized squared contribution of a cell to the whole imbalance. Summing them provides the general standardized squared deviation for the desk, referred to as the Chi-Sq. statistic.
start{aligned}
chi^2 &= 0.6716 + 1.3636 + 0.6716 + 1.3636 [6pt]
&= 4.0704 [6pt]
&approx 4.07
finish{aligned}
We obtained a Chi-Sq. statistic of 4.07.
How can we interpret this worth?
After calculating the chi-square statistic, we evaluate it with the crucial worth from the chi-square distribution desk for 1 diploma of freedom at a significance degree of 0.05.
For df = 1 and α = 0.05, the crucial worth is 3.84. Since our calculated worth (4.07) is bigger than 3.84, we reject the null speculation.
The chi-square check is full at this level, however we nonetheless want to know what df = 1 means and the way the crucial worth of three.84 is obtained.
That is the place issues begin to get each fascinating and barely complicated.
First, let’s perceive what df = 1 means.
‘df’ means Levels of Freedom.
From our knowledge,

We will name this a Contingency desk and to be particular it’s a 2*2 contingency desk as a result of it’s outlined by variety of classes in variable 1 as rows and variety of classes in variable 2 as columns. Right here we’ve got 2 rows and a pair of columns.
We will observe that the row totals and column totals are mounted. Because of this if one cell worth adjustments, the opposite three should regulate accordingly to protect these totals.
In different phrases, there is just one unbiased manner the desk can range whereas maintaining the row and column totals mounted. Due to this fact, the desk has 1 diploma of freedom.
We will additionally compute the levels of freedom utilizing the usual method for a contingency desk:
[
df = (r – 1)(c – 1)
]
the place r is the variety of rows and c is the variety of columns.
In our instance, we’ve got a 2*2 desk, so:
[
df = (2 – 1)(2 – 1)
]
[
df = 1
]
We now have an concept of what levels of freedom imply from the information desk. However why do we have to calculate them?
Now, let’s think about a four-dimensional house through which every axis corresponds to at least one cell of the contingency desk:
Axis 1: Low-cost & Offered
Axis 2: Low-cost & Not Offered
Axis 3: Excessive-cost & Offered
Axis 4: Excessive-cost & Not Offered
From the information desk, we’ve got the noticed counts (320, 180, 350, 150). We additionally calculated the anticipated counts below independence as (335, 165, 335, 165).
Each the noticed and anticipated counts might be represented as factors in a four-dimensional house.
Now we’ve got two factors in a four-dimensional house.
We already calculated the distinction between noticed and anticipated counts (-15, 15, 15, -15).
We will write it as -15(1, -1, -1, 1)
Within the noticed knowledge,

Let’s say we improve the Low-cost & Offered rely from 320 to 321 (a +1 change).
To maintain the row and column totals mounted, Low-cost & Not Offered should lower by 1, Excessive-cost & Offered should lower by 1, and Excessive-cost & Not Offered should improve by 1.
This produces the sample (1, −1, −1, 1).
Any legitimate change in a 2×2 desk with mounted margins follows this identical sample multiplied by some scalar.
Underneath mounted row and column totals, many various 2×2 tables are doable. After we characterize every desk as a degree in four-dimensional house, these tables lie on a one-dimensional straight line.
We will confer with the anticipated counts, (335, 165, 335, 165), as the middle of that straight line and let’s denote that time as E.
The purpose E lies on the middle of the road as a result of, below pure randomness (independence), these are the values we anticipate to watch.
We then measure how a lot the noticed counts deviate from these anticipated counts.
We will observe that each level on the road is:
E + x (1, −1, −1, 1)
the place x is any scalar.
From our noticed knowledge desk, we are able to write it as:
O = E + (-15) (1, −1, −1, 1)
Equally, each level might be written like this.
The (1, −1, −1, 1) defines the path of the one-dimensional deviation house. We name it as a path vector. Scalar worth simply tells us how far to maneuver in that path.
Each legitimate desk is obtained by beginning on the anticipated desk and shifting far alongside this path.
For instance, any level on the road is (335+x, 165-x, 335-x, 165+x).
Substituting x=−15, the values change into
(335−15, 165+15, 335+15, 165−15),
which simplifies to (320, 180, 350, 150).
This matches our noticed desk.
We will think about that as x adjustments, the desk strikes solely in a single path alongside a straight line.
Because of this your entire deviation from independence is managed by a single scalar worth, which strikes the desk alongside a straight line.
Since all tables lie alongside a one-dimensional line, the system has just one unbiased path of motion. That is why the levels of freedom equal 1.
At this level, we all know easy methods to compute the chi-square statistic. As derived earlier, standardizing the deviation from the anticipated rely and squaring it ends in a chi-square worth of 4.07.
Now that we perceive what levels of freedom imply, let’s discover what the chi-square distribution truly is.
Coming again to our noticed knowledge, we’ve got 1000 books in whole. Out of those, 670 have been bought and 330 weren’t bought.
Underneath the belief of independence (i.e., cowl sort doesn’t affect whether or not a e-book is bought), we are able to think about randomly deciding on 670 books out of 1000 and labeling them as “sold.”
We then rely what number of of those chosen books have a low-cost cowl sort. Let this rely be denoted by X.
If we repeat this experiment many instances as mentioned earlier, every repetition would produce a distinct worth of X, comparable to 321, 322, 326 and so forth.
Now if we plot these values throughout many repetitions, then we are able to observe that the values cluster round 335, forming a bell-shape curve.
Plot:

We will observe the Regular Distribution.
From our noticed knowledge desk, the variety of Low-cost and Offered books is 320. The distribution proven above represents how values behave below independence.
We see that values like 334 and 336 are widespread, whereas 330 and 340 are considerably much less widespread. A price like 320 seems to be comparatively uncommon.
However how will we decide this appropriately? To reply that, we should evaluate 320 to the middle of the distribution, which is 335, and take into account how vast the curve is.
The width of the curve displays how a lot pure variation we anticipate below independence. Primarily based on this unfold, we are able to assess how continuously a worth like 320 would happen.
For that we have to carry out Standardization.
Anticipated worth: ( mu = 335 )
Noticed worth: ( X = 320 )
Distinction: ( 320 – 335 = -15 )
Commonplace deviation: ( sigma approx 7.44 )
[
Z = frac{320 – 335}{7.44} approx -2.0179
]
So, 320 is about two customary deviations beneath the typical.
We already know that we calculated the Z-score right here.
The Z-score of 320 is roughly −2.0179.
In the identical manner, if we standardize every doable of X, then the above sampling distribution of X will get remodeled into the usual regular distribution with imply = 0 and customary deviation = 1.

Now we already know that 320 is about two customary deviations beneath the typical.
Z-Rating = -2.0179
We already computed a chi-square statistic equal to 4.07.
Now let’s sq. the Z-Rating
Z2 = (−2.0179)2 = 4.0719 and this is the same as our chi-square statistic.
If a standardized deviation follows an ordinary regular distribution, then squaring that random variable transforms the distribution right into a chi-square distribution with one diploma of freedom.

That is the curve obtained after we sq. an ordinary regular random variable Z. Since squaring removes the signal, each constructive and unfavourable values of Z map to constructive values.
Consequently, the symmetric bell-shaped distribution is remodeled right into a right-skewed distribution that follows a chi-square distribution with one diploma of freedom.
When the levels of freedom is 1, we truly don’t have to suppose when it comes to squaring to decide.
There is just one unbiased deviation from independence, so we are able to standardize it and carry out a two-sided Z-test.
Squaring merely turns that Z worth right into a chi-square worth, when df = 1. Nevertheless, when the levels of freedom are higher than 1, there are a number of unbiased deviations.
If we simply add these deviations collectively, constructive and unfavourable values cancel out.
Squaring ensures that every one deviations contribute positively to the whole deviation.
That’s the reason the chi-square statistic all the time sums squared standardized deviations, particularly when df is bigger than 1.
We now have a clearer understanding of how the conventional distribution is linked to the chi-square distribution.
Now let’s use this distribution to carry out speculation testing.
Null Speculation (H₀)
The quilt sort and gross sales consequence are unbiased. (The quilt sort has no impact)
Various Speculation (H₁)
The quilt sort and gross sales consequence will not be unbiased. (The quilt sort is related to whether or not a e-book is bought.)
A generally used significance degree is α = 0.05. This implies we reject the null speculation provided that our consequence falls inside probably the most excessive 5% of outcomes below the null speculation.
From the Chi-Sq. distribution at df = 1 and α = 0.05: the crucial worth is 3.84.
The worth 3.84 is the crucial (cut-off) worth. The realm to the precise of three.84 equals 0.05, representing the rejection area.
Since our calculated chi-square statistic exceeds 3.84, it falls inside this rejection area.

The p-value right here is 0.043, which is the realm to the precise of 4.07.
This implies if cowl sort and gross sales have been actually unbiased, there could be solely a 4.3% probability of observing a distinction this massive.
Now whether or not these outcomes are dependable or not is dependent upon the assumptions of the chi-square check.
Let’s take a look at the assumptions for this check:
1) Independence of Observations
On this context, independence signifies that one e-book sale mustn’t affect one other. The identical buyer shouldn’t be counted a number of instances, and observations shouldn’t be paired or repeated.
2) Information have to be Categorical counts.
3) Anticipated Frequencies Ought to Not Be Too Small
All anticipated cell counts ought to typically be a minimum of 5.
4) Random Sampling
The pattern ought to characterize the inhabitants.
As a result of all of the assumptions are glad and the p-value (0.043) is beneath 0.05, we reject the null speculation and conclude that cowl sort and gross sales are statistically related.
At this level, you is perhaps confused about one thing.
We spent numerous time specializing in one cell, for instance the low-cost books that have been bought.
We calculated its deviation, standardized it, and used that to know how the chi sq. statistic is shaped.
However what in regards to the different cells? What about high-cost books or the unsold ones?
The essential factor to comprehend is that in a 2×2 desk, all 4 cells are linked. As soon as the row totals and column totals are mounted, the desk has solely one diploma of freedom.
This implies the counts can not range independently. If one cell will increase, then different cells robotically adjusted to maintain the totals constant.
As we mentioned earlier, we are able to consider all doable tables with the identical margins as factors in a four-dimensional house.
Nevertheless, due to the constraints imposed by the mounted totals, these factors don’t unfold out in each path. As a substitute, they lie alongside a single straight line, which we already mentioned earlier.
Each deviation from independence strikes the desk solely alongside that one path, which we mentioned earlier.
So, when one cell deviates by, say, +15 from its anticipated worth, the opposite cells are robotically decided by the construction of the desk.
The entire desk shifts collectively. The deviation isn’t just about one quantity. It represents the motion of your entire system.
After we compute the chi sq. statistic, we subtract noticed from anticipated for all cells and standardize every deviation.
However in a 2×2 desk, these deviations are tied collectively. They transfer as one coordinated construction.
This implies, inspecting one cell is sufficient to perceive how far your entire desk has moved away from independence and in addition in regards to the distribution.
Studying by no means ends, and there’s nonetheless way more to discover in regards to the chi-square check.
I hope this text has given you a transparent understanding of what the chi-square check truly does.
In one other weblog, we’ll focus on what occurs when the assumptions will not be met and why the chi-square check might fail in these conditions.
There was a small pause in my time collection collection. I noticed that a number of matters deserved extra readability and cautious pondering, so I made a decision to decelerate as an alternative of pushing ahead. I’ll return to it quickly with explanations that really feel extra full and intuitive.
If you happen to loved this text, you may discover extra of my writing on Medium and LinkedIn.
Thanks for studying!



