Chapter 12 Goodness of Fit

Learning Objectives

  1. Use a chi-square test to test the hypothesis that a random sample is from a particular distribution, including cases where parameters are unknown.
  2. Explain what is meant by a contingency (or two-way) table, and use a chi-square test to test the independence of two classification criteria.

Theory

Chapter Summary: Goodness of Fit (Chi-square Tests)

This chapter focuses on using chi-square (\(\chi^2\)) tests to assess if observed data aligns with a theoretical distribution or if two categorical variables are independent.

12.0.0.1 1. Chi-square Test for Goodness-of-Fit

This test assesses whether observed frequencies from a random sample conform to a specified distribution (e.g., Binomial, Poisson, or a fair die).

Learning Objective: Use a chi-square test to test the hypothesis that a random sample is from a particular distribution, including cases where parameters are unknown.

Key Steps: * Hypotheses: * Null Hypothesis (\(H_0\)): The observed frequencies conform to the specified distribution. * Alternative Hypothesis (\(H_1\)): The observed frequencies do not conform to the specified distribution. * Parameter Estimation (if unknown): Calculate estimates for any unknown parameters of the theoretical distribution (e.g., \(\lambda\) for Poisson, \(p\) for Binomial) using Maximum Likelihood Estimation (MLE) from the sample data. * Expected Frequencies (\(E_i\)): Use the estimated parameters to calculate the expected frequency for each category or cell under the null hypothesis. This is done by multiplying the total number of observations by the probability for each category. * Test Statistic: Calculate the \(\chi^2\) test statistic using the formula: \(\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\). * \(O_i\) represents the observed frequency in category \(i\). * \(E_i\) represents the expected frequency in category \(i\). * Degrees of Freedom (dof): The degrees of freedom for this test are calculated as \(n - k - 1\), where: * \(n\) is the number of categories/cells. * \(k\) is the number of parameters estimated from the data. * Decision: * The test is an upper one-sided test; a larger test statistic indicates a poorer fit. * Compare the calculated \(\chi^2\) statistic to the critical value from the \(\chi^2\) distribution for the determined dof and chosen significance level. * If the test statistic exceeds the critical value, reject \(H_0\). This suggests the data does not fit the specified distribution. * Reliability: The test becomes unreliable if any expected cell frequency is too small. A common guideline is that all expected frequencies should be greater than 1, and no more than 20% of cells should have expected frequencies under 5. Cells may need to be combined to meet these conditions. * R Implementation: The chisq.test() function can be used. Note that for cases requiring parameter estimation, manual adjustment of degrees of freedom is needed as the built-in function doesn’t automatically do this.

12.0.0.2 2. Chi-square Test for Independence (Contingency Tables)

This test determines if there is an association between two categorical variables, often presented in a contingency (or two-way) table.

Learning Objective: Explain what is meant by a contingency (or two-way) table, and use a chi-square test to test the independence of two classification criteria.

Key Concepts: * Contingency Table: A table that displays the observed frequencies of two or more categorical variables, with rows representing one variable’s categories and columns representing another’s. * For example, classifying accident victims by seatbelt use and injury severity. * Hypotheses: * Null Hypothesis (\(H_0\)): The two classification criteria (variables) are independent. * Alternative Hypothesis (\(H_1\)): The two classification criteria are not independent (i.e., there is an association). * Expected Frequencies (\(E_{ij}\)): For each cell \((i, j)\) in the table, the expected frequency under the assumption of independence is calculated as: \(E_{ij} = \frac{(\text{Row } i \text{ Total}) \times (\text{Column } j \text{ Total})}{\text{Grand Total}}\). * Test Statistic: The \(\chi^2\) test statistic is calculated using the same formula as the goodness-of-fit test: \(\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\). * \(O_{ij}\) is the observed frequency in cell \((i, j)\). * Degrees of Freedom (dof): For a contingency table with \(R\) rows and \(C\) columns, the degrees of freedom are \((R-1)(C-1)\). * Decision: * Similar to goodness-of-fit, this is an upper one-sided test. * Compare the calculated \(\chi^2\) statistic to the critical value from the \(\chi^2\) distribution for the determined dof and chosen significance level. * If the test statistic exceeds the critical value, reject \(H_0\), concluding that there is evidence of an association between the variables. * Special Considerations: * Yates’ Continuity Correction: For 2x2 contingency tables, Yates’ continuity correction is often applied to the \(\chi^2\) test statistic. This can be turned off in R by adding correct=FALSE to chisq.test() to match Paper A expectations. * Fisher’s Exact Test: For 2x2 tables, especially with small expected frequencies, Fisher’s exact test is a recommended alternative. It determines the sampling distribution precisely using permutations. * R Implementation: Use chisq.test() after structuring your observed values into a matrix using matrix().

R Practice