Your SEO optimized title page contents

What is a Chi-square test and its use in Data Science

Chi-square test

The two by two or fourfold contingency table represents two classifications of a set of counts or frequencies. The rows represent two classifications of one variable (e.g. outcome positive/outcome negative) and the columns represent two classifications of another variable (e.g. intervention/no intervention). These classifications must be independent. Paired results (e.g. outcomes for same group of individuals before and after intervention) should be analysed using a test for matched pairs. The chi-square statistic calculated from the table test the independence between the two classifications.

Assumptions of the tests of independence:

  • the sample is random
  • each observation may be classified into one cell (in the table) only


– where, for r rows and c columns of n observations, O is an observed frequency and E is an estimated expected frequency. The expected frequency for any cell is estimated as the row total times the column total then divided by the grand total (n).


Yates’ continuity correction improves the approximation of the discrete sample chi-square statistic to a continuous chi-square distribution (Armitage and Berry, 1994):


The r by c chi-square function can be used to examine two by two tables in greater detail.


Pearson’s and Cramér’s (V) coefficients of contingency reflect the strength of the association in a contingency table (Agresti, 1996; Fleiss, 1981; Stuart and Ord, 1994):


You will see another coefficient, phi (ϕ, correlation), given by the r by c chi-square function. This is equal to an unsigned version of V with 2 by 2 tables.


Fisher’s exact test should be used as an alternative to the fourfold chi-square test if the total number of observations is less than twenty or any of the expected frequencies are less than five. In practical terms, however, there is little point in using the fourfold chi-square for testing independence when StatsDirect provides a Fisher’s exact test that can cope with large numbers.


If you specify that your results are from a case-control study then StatsDirect adds an odds ratio analysis. With a, b, c and d as the observed frequencies arranged as in the table below:


Odds ratio (OR) is related to risk ratio (RR, relative risk):

RR = (a / (a+c)) / (b / (b+d))


When a is small in comparison to c and b is small in comparison to d (i.e. relatively small numbers of outcome positive observations or low prevalence) then c can be substituted for a+c and d can be substituted for d+b in the above. With a little rearrangement this gives the odds ratio (cross ratio, approximate relative risk):

OR = (a*d)/(b*c).

A confidence interval (CI) for the odds ratio is calculated using two different methods. The Wolf (logit) method for large samples is given first followed by an exact conditional maximum likelihood method (Fleiss, 1979; Gardner and Altman, 1989; Martin and Austin, 1991). Please note that the exact calculations may take an appreciable amount of time with large numbers.

If you specify that your results are from a cohort study then StatsDirect adds a relative risk analysis. See risk (prospective) for more details.


Observed frequencies should be entered as a standard fourfold table:

feature present feature absent
outcome positive: a b
outcome negative: c d


From Armitage and Berry (1994).

The following represent mortality data for two groups of patients receiving different treatments, A and B.

Dead Alive
Treatment / Exposure A 41 216
B 64 180

To analyse these data in StatsDirect you must select the 2 by 2 contingency table from the chi-square section of the analysis menu. Select the default 95% confidence interval. Enter the frequencies into the contingency table on screen. Note that the input screen has outcome values from top to bottom and the other classifier (e.g. treatment) from left to right, some books and papers show these the other way around.

For this example:

Observed values and totals:

41 216 257
64 180 244
105 396 501

Expected values:

53.862275 203.137725
51.137725 192.862275


Uncorrected Chi² = 7.978869 P = 0.0047

Yates-corrected Chi² = 7.370595 P = 0.0066


Measures of association:

Pearson’s contingency = 0.125205

Cramér’s V (signed) = -0.126198


Odds ratio analysis


Odds Ratio = 0.533854


Using the Woolf (logit) approximation:

Approximate 95% confidence interval = 0.344118 to 0.828206


Using conditional likelihood estimation:

Fisher exact 95% confidence interval = 0.334788 to 0.846292

Exact Fisher one sided P = 0.0033, two sided P = 0.0059

mid-P exact 95% confidence interval = 0.342636 to 0.828071

Exact mid-P one sided P = 0.0024, two sided P = 0.0049


Here we can see a statistically significant relationship between treatment and mortality. The strength of that relationship is reflected by the coefficient of contingency. The odds ratio tells us that the odds in favour of dying after treatment A are about half of the odds of dying after treatment B. With 95% confidence we put the true population value for this ratio of odds somewhere between 0.33 and 0.85. If you need to phrase the arguments with odds ratios the other way around then just quote the reciprocals, i.e. here we would say that the odds of dying after treatment B are 1.9 times greater than after treatment A.