who has written a children’s book and released it in two versions at the same time into the market at the same price. One version has a basic cover design, while the other has a high-quality cover design, which of course cost him more.
He then observes the sales for a certain period and gathers the data shown below.

Now he comes to us and wants to know whether the cover design of his books has affected their sales.
From the sales data, we can observe that there are two categorical variables. The first is cover type, which is either high cost or low cost, and the second is sales outcome, which is either sold or not sold.
Now we want to know whether these two categorical variables are related or not.
We know that when we need to find a relationship between two categorical variables, we use the Chi-square test for independence.
In this scenario, we will generally use Python to apply the Chi-square test and calculate the chi-square statistic and p-value.
Code:
import numpy as np
from scipy.stats import chi2_contingency
# Observed data
observed = np.array([
[320, 180],
[350, 150]
])
chi2, p, dof, expected = chi2_contingency(observed, correction=False)
print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies:\n", expected)
Result:

The chi-square statistic is 4.07 with a p-value of 0.043 which is below the 0.05 threshold. This suggests that the cover type and sales are statistically associated.
We have now obtained the p-value, but before treating it as a decision, we need to understand how we got this value and what the assumptions of this test are.
Understanding this can help us decide whether the result we obtained is reliable or not.
Now let’s try to understand what the Chi-Square test actually is.
We have this data.

By observing the data, we can say that sales for books with the high-cost cover are higher, so we may think that the cover worked.
However, in real life, the numbers fluctuate by chance even if the cover has no effect or customers pick books randomly. We can still get unequal values.
Randomness always creates imbalances.
Now the question is, “Is this difference bigger than what randomness usually creates?”
Let’s see how Chi-Square test answers that question.
We already have this formula to calculate the Chi-Square statistic.
\[
\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c}
\frac{(O_{ij} – E_{ij})^2}{E_{ij}}
\]
where:
χ² is the Chi-Square test statistic
i represents the row index
j represents the column index
Oᵢⱼ is the observed count in row i and column j
Eᵢⱼ is the expected count in row i and column j
First let’s focus on Expected Counts.
Before understanding what expected counts are, let’s state the hypothesis for our test.
Null Hypothesis (H₀)
The cover type and sales outcome are independent. (The cover type has no effect)
Alternative Hypothesis (H₁)
The cover type and sales outcome are not independent. (The cover type is associated with whether a book is sold.)
Now what do we mean by expected counts?
Let’s say the null hypothesis is true, which means the cover type has no effect on the sales of books.
Let’s go back to probabilities.
As we already know, the formula for simple probability is:
\[P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}}\]
In our data, the overall probability of a book being sold is:
\[P(\text{Sold}) = \frac{\text{Number of books sold}}{\text{Total number of books}} = \frac{670}{1000} = 0.67\]
In probability, when we write P(A∣B), we mean the probability of event A given that event B has already occurred.
\[
\text{Under independence, cover type and sales are not related.} \\
\text{This means the probability of being sold does not depend on cover type.} \\
\text{which means} \\
P(\text{Sold} \mid \text{Low-cost cover}) = P(\text{Sold}) \\
P(\text{Sold} \mid \text{High-cost cover}) = P(\text{Sold}) \\
P(\text{Sold}) = \frac{670}{1000} = 0.67 \\
\text{Therefore, }
P(\text{Sold} \mid \text{Low-cost cover}) = 0.67
\]
Under independence, we have P (Sold | Low-cost Cover) = 0.67, which means 67% of low-cost cover books are expected to be sold.
Since we have 500 books with low-cost covers, we convert this probability into an expected number of sold books.
\[0.67 \times 500 = 335\]
This means we expect 335 low-cost cover books to be sold under independence.
Based on our data table, we can represent this as E11.
Similarly, the expected value for the high-cost cover and sold is also 335, which is represented by E21.
Now let’s calculate E12 – Low-cost cover, Not Sold and E22 – High-cost cover, Not Sold.
The overall probability of a book not being sold is:
\[P(\text{Not Sold}) = \frac{330}{1000} = 0.33\]
Under independence, this probability applies to each sub group as earlier.
\[P(\text{Not Sold} \mid \text{Low-cost cover}) = 0.33\]
\[P(\text{Not Sold} \mid \text{High-cost cover}) = 0.33\]
Now we convert this probability into the expected count of unsold books.
\[E_{12} = 0.33 \times 500 = 165\]
\[E_{22} = 0.33 \times 500 = 165\]
We used probabilities here to understand the idea of expected counts, but we already have direct formulas to calculate them. Let’s also take a look at those.
Formula to calculate Expected Counts:
\[E_{ij} = \frac{R_i \times C_j}{N}\]
Where:
- Ri = Row total
- Cj = Column total
- N = Grand total
Low-cost cover, Sold:
\[E_{11} = \frac{500 \times 670}{1000} = 335\]
Low-cost cover, Not Sold:
\[E_{12} = \frac{500 \times 330}{1000} = 165\]
High-cost cover, Sold:
\[E_{12} = \frac{500 \times 670}{1000} = 335\]
High-cost cover, Not Sold:
\[E_{22} = \frac{500 \times 330}{1000} = 165\]
In both ways, we get the same values.
By calculating expected counts, what we are finding is this: if we assume the null hypothesis is true, then the two categorical variables are independent.
Here, we have 1,000 books and we know that 670 are sold. Now we imagine randomly picking books and labeling them as sold.
After selecting 670 books, we check how many of them belong to the low-cost cover group and how many belong to the high-cost cover group.
If we repeat this process many times, we would obtain values around 335. Sometimes they may be 330 or 340.
We then consider the average, and 335 becomes the central point of the distribution if everything happens purely due to randomness.
This does not mean the count must equal 335, but that 335 represents the natural center of variation under independence.
The Chi-Square test then measures how far the observed count deviates from this central value relative to the variation expected under randomness.
We calculated the expected counts:
E11 = 335; E21 = 335; E12 = 165; E22 = 165

The next step is to calculate the deviation between the observed and expected counts. To do this, we subtract the expected count from the observed count.
\begin{aligned}
\text{Low-Cost Cover & Sold:} \quad & O – E = 320 – 335 = -15 \\[8pt]
\text{Low-Cost Cover & Not Sold:} \quad & O – E = 180 – 165 = 15 \\[8pt]
\text{High-Cost Cover & Sold:} \quad & O – E = 350 – 335 = 15 \\[8pt]
\text{High-Cost Cover & Not Sold:} \quad & O – E = 150 – 165 = -15
\end{aligned}
In the next step, we square the differences because if we add the raw deviations, the positive and negative values cancel out, resulting in zero.
This would incorrectly suggest that there is no imbalance. Squaring solves the cancellation problem by allowing us to measure the magnitude of the imbalance, regardless of direction.
\begin{aligned}
\text{Low-Cost Cover & Sold:} \quad & (O – E)^2 = (-15)^2 = 225 \\[6pt]
\text{Low-Cost Cover & Not Sold:} \quad & (15)^2 = 225 \\[6pt]
\text{High-Cost Cover & Sold:} \quad & (15)^2 = 225 \\[6pt]
\text{High-Cost Cover & Not Sold:} \quad & (-15)^2 = 225
\end{aligned}
Now that we have calculated the squared deviations for each cell, the next step is to divide them by their respective expected counts.
This standardizes the deviations by scaling them relative to what was expected under the null hypothesis.
\begin{aligned}
\text{Low-Cost Cover & Sold:} \quad & \frac{(O – E)^2}{E} = \frac{225}{335} = 0.6716 \\[6pt]
\text{Low-Cost Cover & Not Sold:} \quad & \frac{225}{165} = 1.3636 \\[6pt]
\text{High-Cost Cover & Sold:} \quad & \frac{225}{335} = 0.6716 \\[6pt]
\text{High-Cost Cover & Not Sold:} \quad & \frac{225}{165} = 1.3636
\end{aligned}
Now, for every cell, we have calculated:
\begin{aligned}
\frac{(O – E)^2}{E}
\end{aligned}
Each of these values represents the standardized squared contribution of a cell to the total imbalance. Summing them gives the overall standardized squared deviation for the table, known as the Chi-Square statistic.
\begin{aligned}
\chi^2 &= 0.6716 + 1.3636 + 0.6716 + 1.3636 \\[6pt]
&= 4.0704 \\[6pt]
&\approx 4.07
\end{aligned}
We obtained a Chi-Square statistic of 4.07.
How can we interpret this value?
After calculating the chi-square statistic, we compare it with the critical value from the chi-square distribution table for 1 degree of freedom at a significance level of 0.05.
For df = 1 and α = 0.05, the critical value is 3.84. Since our calculated value (4.07) is greater than 3.84, we reject the null hypothesis.
The chi-square test is complete at this point, but we still need to understand what df = 1 means and how the critical value of 3.84 is obtained.
This is where things start to get both interesting and slightly confusing.
First, let’s understand what df = 1 means.
‘df’ means Degrees of Freedom.
From our data,

We can call this a Contingency table and to be specific it is a 2*2 contingency table because it is defined by number of categories in variable 1 as rows and number of categories in variable 2 as columns. Here we have 2 rows and 2 columns.
We can observe that the row totals and column totals are fixed. This means that if one cell value changes, the other three must adjust accordingly to preserve those totals.
In other words, there is only one independent way the table can vary while keeping the row and column totals fixed. Therefore, the table has 1 degree of freedom.
We can also compute the degrees of freedom using the standard formula for a contingency table:
\[
df = (r – 1)(c – 1)
\]
where r is the number of rows and c is the number of columns.
In our example, we have a 2*2 table, so:
\[
df = (2 – 1)(2 – 1)
\]
\[
df = 1
\]
We now have an idea of what degrees of freedom mean from the data table. But why do we need to calculate them?
Now, let’s imagine a four-dimensional space in which each axis corresponds to one cell of the contingency table:
Axis 1: Low-cost & Sold
Axis 2: Low-cost & Not Sold
Axis 3: High-cost & Sold
Axis 4: High-cost & Not Sold
From the data table, we have the observed counts (320, 180, 350, 150). We also calculated the expected counts under independence as (335, 165, 335, 165).
Both the observed and expected counts can be represented as points in a four-dimensional space.
Now we have two points in a four-dimensional space.
We already calculated the difference between observed and expected counts (-15, 15, 15, -15).
We can write it as -15(1, -1, -1, 1)
In the observed data,

Let’s say we increase the Low-cost & Sold count from 320 to 321 (a +1 change).
To keep the row and column totals fixed, Low-cost & Not Sold must decrease by 1, High-cost & Sold must decrease by 1, and High-cost & Not Sold must increase by 1.
This produces the pattern (1, −1, −1, 1).
Any valid change in a 2×2 table with fixed margins follows this same pattern multiplied by some scalar.
Under fixed row and column totals, many different 2×2 tables are possible. When we represent each table as a point in 4-dimensional space, these tables lie on a one-dimensional straight line.
We can refer to the expected counts, (335, 165, 335, 165), as the center of that straight line and let’s denote that point as E.
The point E lies at the center of the line because, under pure randomness (independence), these are the values we expect to observe.
We then measure how much the observed counts deviate from these expected counts.
We can observe that every point on the line is:
E + x (1, −1, −1, 1)
where x is any scalar.
From our observed data table, we can write it as:
O = E + (-15) (1, −1, −1, 1)
Similarly, every point can be written like this.
The (1, −1, −1, 1) defines the direction of the one-dimensional deviation space. We call it as a direction vector. Scalar value just tells us how far to move in that direction.
Every valid table is obtained by starting at the expected table and moving some distance along this direction.
For example, any point on the line is (335+x, 165-x, 335-x, 165+x).
Substituting x=−15, the values become
(335−15, 165+15, 335+15, 165−15),
which simplifies to (320, 180, 350, 150).
This matches our observed table.
We can imagine that as x changes, the table moves only in one direction along a straight line.
This means that the entire deviation from independence is controlled by a single scalar value, which moves the table along a straight line.
Since all tables lie along a one-dimensional line, the system has only one independent direction of movement. This is why the degrees of freedom equal 1.
At this point, we know how to compute the chi-square statistic. As derived earlier, standardizing the deviation from the expected count and squaring it results in a chi-square value of 4.07.
Now that we understand what degrees of freedom mean, let’s explore what the chi-square distribution actually is.
Coming back to our observed data, we have 1000 books in total. Out of these, 670 were sold and 330 were not sold.
Under the assumption of independence (i.e., cover type does not influence whether a book is sold), we can imagine randomly selecting 670 books out of 1000 and labeling them as “sold.”
We then count how many of these selected books have a low-cost cover type. Let this count be denoted by X.
If we repeat this experiment many times as discussed earlier, each repetition would produce a different value of X, such as 321, 322, 326 and so on.
Now if we plot these values across many repetitions, then we can observe that the values cluster around 335, forming a bell-shape curve.
Plot:

We can observe the Normal Distribution.
From our observed data table, the number of Low-cost and Sold books is 320. The distribution shown above represents how values behave under independence.
We see that values like 334 and 336 are common, while 330 and 340 are somewhat less common. A value like 320 appears to be relatively rare.
But how do we determine this correctly? To answer that, we must compare 320 to the center of the distribution, which is 335, and consider how wide the curve is.
The width of the curve reflects how much natural variation we expect under independence. Based on this spread, we can assess how frequently a value like 320 would occur.
For that we need to perform Standardization.
Expected value: \( \mu = 335 \)
Observed value: \( X = 320 \)
Difference: \( 320 – 335 = -15 \)
Standard deviation: \( \sigma \approx 7.44 \)
\[
Z = \frac{320 – 335}{7.44} \approx -2.0179
\]
So, 320 is about two standard deviations below the average.
We already know that we calculated the Z-score here.
The Z-score of 320 is approximately −2.0179.
In the same way, if we standardize each possible of X, then the above sampling distribution of X gets transformed into the standard normal distribution with mean = 0 and standard deviation = 1.

Now we already know that 320 is about two standard deviations below the average.
Z-Score = -2.0179
We already computed a chi-square statistic equal to 4.07.
Now let’s square the Z-Score
Z2 = (−2.0179)2 = 4.0719 and this is equal to our chi-square statistic.
If a standardized deviation follows a standard normal distribution, then squaring that random variable transforms the distribution into a chi-square distribution with one degree of freedom.

This is the curve obtained when we square a standard normal random variable Z. Since squaring removes the sign, both positive and negative values of Z map to positive values.
As a result, the symmetric bell-shaped distribution is transformed into a right-skewed distribution that follows a chi-square distribution with one degree of freedom.
When the degrees of freedom is 1, we actually do not need to think in terms of squaring to make a decision.
There is only one independent deviation from independence, so we can standardize it and perform a two-sided Z-test.
Squaring simply turns that Z value into a chi-square value, when df = 1. However, when the degrees of freedom are greater than 1, there are multiple independent deviations.
If we just add those deviations together, positive and negative values cancel out.
Squaring ensures that all deviations contribute positively to the total deviation.
That is why the chi-square statistic always sums squared standardized deviations, especially when df is greater than 1.
We now have a clearer understanding of how the normal distribution is linked to the chi-square distribution.
Now let’s use this distribution to perform hypothesis testing.
Null Hypothesis (H₀)
The cover type and sales outcome are independent. (The cover type has no effect)
Alternative Hypothesis (H₁)
The cover type and sales outcome are not independent. (The cover type is associated with whether a book is sold.)
A commonly used significance level is α = 0.05. This means we reject the null hypothesis only if our result falls within the most extreme 5% of outcomes under the null hypothesis.
From the Chi-Square distribution at df = 1 and α = 0.05: the critical value is 3.84.
The value 3.84 is the critical (cut-off) value. The area to the right of 3.84 equals 0.05, representing the rejection region.
Since our calculated chi-square statistic exceeds 3.84, it falls within this rejection region.

The p-value here is 0.043, which is the area to the right of 4.07.
This means if cover type and sales were truly independent, there would be only a 4.3% chance of observing a difference this large.
Now whether these results are reliable or not depends on the assumptions of the chi-square test.
Let’s look at the assumptions for this test:
1) Independence of Observations
In this context, independence means that one book sale should not influence another. The same customer should not be counted multiple times, and observations should not be paired or repeated.
2) Data must be Categorical counts.
3) Expected Frequencies Should Not Be Too Small
All expected cell counts should generally be at least 5.
4) Random Sampling
The sample should represent the population.
Because all the assumptions are satisfied and the p-value (0.043) is below 0.05, we reject the null hypothesis and conclude that cover type and sales are statistically associated.
At this point, you might be confused about something.
We spent a lot of time focusing on one cell, for example the low-cost books that were sold.
We calculated its deviation, standardized it, and used that to understand how the chi square statistic is formed.
But what about the other cells? What about high-cost books or the unsold ones?
The important thing to realize is that in a 2×2 table, all four cells are connected. Once the row totals and column totals are fixed, the table has only one degree of freedom.
This means the counts cannot vary independently. If one cell increases, then other cells automatically adjusted to keep the totals consistent.
As we discussed earlier, we can think of all possible tables with the same margins as points in a four-dimensional space.
However, because of the constraints imposed by the fixed totals, those points do not spread out in every direction. Instead, they lie along a single straight line, which we already discussed earlier.
Every deviation from independence moves the table only along that one direction, which we discussed earlier.
So, when one cell deviates by, say, +15 from its expected value, the other cells are automatically determined by the structure of the table.
The whole table shifts together. The deviation is not just about one number. It represents the movement of the entire system.
When we compute the chi square statistic, we subtract observed from expected for all cells and standardize each deviation.
But in a 2×2 table, those deviations are tied together. They move as one coordinated structure.
This means, examining one cell is enough to understand how far the entire table has moved away from independence and also about the distribution.
Learning never ends, and there is still much more to explore about the chi-square test.
I hope this article has given you a clear understanding of what the chi-square test actually does.
In another blog, we will discuss what happens when the assumptions are not met and why the chi-square test may fail in those situations.
There has been a small pause in my time series series. I realized that a few topics deserved more clarity and careful thinking, so I decided to slow down instead of pushing forward. I will return to it soon with explanations that feel more complete and intuitive.
If you enjoyed this article, you can explore more of my writing on Medium and LinkedIn.
Thanks for reading!