Inference Test for Two Proportions

Arvind V.

2022-11-10

Setting up R packages

library(tidyverse)
library(mosaic)
library(vcd)
library(visStatistics) # One package to test them all

### Dataset from Chihara and Hesterberg's book (Second Edition)
library(resampledata)

Plot Fonts and Theme

Code
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    # theme(panel.widths = unit(11, "cm"),
    #       panel.heights = unit(6.79, "cm")) + # Golden Ratio

    theme(
      plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
      plot.background = element_rect(
        fill = "bisque",
        colour = "black",
        linewidth = 1
      )
    ) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 10
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 8
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

Introduction

Many experiments gather qualitative data across different segments of a population, for example, opinion about a topic among people who belong to different income groups, or who live in different parts of a city. This should remind us of the Likert Plots that we plotted earlier. In this case the two variables, dependent and independent, are both Qualitative, and we can calculate counts and proportions.

How does one Qual variable affect the other? How do counts/proportions of the dependent variable vary with the levels of the independent variable? This is our task for this module.

Here is a quick example of the kind of data we might look at here, taken from the British Medical Journal:

Figure 1: Breast Feeding

Clearly, we can see differences in counts/proportions of women who breast-fed their babies for three months or more, based on whether they were “printers wives” or “farmers’ wives”!

Is there a doctor in the House?

The CLT for Two Proportions

We first need to establish some model assumptions prior to making our analysis. As before, we wish to see if the CLT applies here, and if so, in what form. The difference between two proportions \(\hat{p_1}-\hat{p_2}\) can be modeled using a normal distribution when:

  • Independence (extended): The data are independent within and between the two groups. Generally this is satisfied if the data come from two independent random samples or if the data come from a randomized experiment.
  • Success-failure condition: The success-failure condition holds for both groups, where we check successes and failures in each group separately. That is, we should have at least 10 successes and 10 failures in each of the two groups.

When these conditions are satisfied, the standard error of \(\hat{p_1}-\hat{p_2}\) is well-approximated by:

\[ SE(\hat{p_1}-\hat{p_2}) = \sqrt{\frac{\hat{p_1}*(1-\hat{p_1})}{n_1}} + \sqrt{\frac{\hat{p_2}*(1-\hat{p_2})}{n_2}} \tag{1}\]

where \(\hat{p_1}\) and \(\hat{p_2}\) represent the sample proportions, and \(n_1\) and \(n_2\) represent the sample sizes.

We can represent the Confidence Intervals as:

\[ \begin{eqnarray} CI(p_1 - p_2) &=& (\hat{p_1} - \hat{p_2}) \pm 1.96 * SE(\hat{p_1}-\hat{p_2})\\ &=& (\hat{p_1} - \hat{p_2}) \pm 1.96 * \left(\sqrt{\frac{\hat{p_1}*(1-\hat{p_1})}{n_1}} + \sqrt{\frac{\hat{p_2}*(1-\hat{p_2})}{n_2}}\right) \end{eqnarray} \tag{2}\]

Case Study-1: GSS2002 dataset

We saw how we could perform inference for a single proportion. We can extend this idea to multiple proportions too.

Let us try a dataset with Qualitative / Categorical data. This is the General Social Survey GSS dataset from the resampledata package, and we have people with different levels of Education stating their opinion on the Death Penalty. We want to know if these two Categorical variables have a correlation, i.e. can the opinions in favour of the Death Penalty be explained by the Education level?

Since data is Categorical ( both variables ), we need to take counts in a table, and then implement a chi-square test. In the test, we will permute the Education variable to see if we can see how significant its effect size is.

data(GSS2002, package = "resampledata")
glimpse(GSS2002)
Rows: 2,765
Columns: 21
$ ID            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ Region        <fct> South Central, South Central, South Central, South Centr…
$ Gender        <fct> Female, Male, Female, Female, Male, Male, Female, Female…
$ Race          <fct> White, White, White, White, White, White, White, White, …
$ Education     <fct> HS, Bachelors, HS, Left HS, Left HS, HS, Bachelors, HS, …
$ Marital       <fct> Divorced, Married, Separated, Divorced, Divorced, Divorc…
$ Religion      <fct> Inter-nondenominational, Protestant, Protestant, Protest…
$ Happy         <fct> Pretty happy, Pretty happy, NA, NA, NA, Pretty happy, NA…
$ Income        <fct> 30000-34999, 75000-89999, 35000-39999, 50000-59999, 4000…
$ PolParty      <fct> "Strong Rep", "Not Str Rep", "Strong Rep", "Ind, Near De…
$ Politics      <fct> Conservative, Conservative, NA, NA, NA, Conservative, NA…
$ Marijuana     <fct> NA, Not legal, NA, NA, NA, NA, NA, NA, Legal, NA, NA, NA…
$ DeathPenalty  <fct> Favor, Favor, NA, NA, NA, Favor, NA, NA, Favor, NA, NA, …
$ OwnGun        <fct> No, Yes, NA, NA, NA, Yes, NA, NA, Yes, NA, NA, NA, NA, N…
$ GunLaw        <fct> Favor, Oppose, NA, NA, NA, Oppose, NA, NA, Oppose, NA, N…
$ SpendMilitary <fct> Too little, About right, NA, About right, NA, Too little…
$ SpendEduc     <fct> Too little, Too little, NA, Too little, NA, Too little, …
$ SpendEnv      <fct> About right, About right, NA, Too little, NA, Too little…
$ SpendSci      <fct> About right, About right, NA, Too little, NA, Too little…
$ Pres00        <fct> Bush, Bush, Bush, NA, NA, Bush, Bush, Bush, Bush, NA, NA…
$ Postlife      <fct> Yes, Yes, NA, NA, NA, Yes, NA, NA, Yes, NA, NA, NA, NA, …


Note how all variables are Categorical !! Education has five levels, and of course DeathPenalty has three:

GSS2002 %>% count(Education)
GSS2002 %>% count(DeathPenalty)

Data Munging

Let us drop NA entries in Education and Death Penalty and also check the levels of the two factors and set them up as we need:

GSS2002_modified <- GSS2002 %>%
  dplyr::select(Education, DeathPenalty) %>%
  tidyr::drop_na(., c(Education, DeathPenalty)) %>%
  # Re-level the factors
  mutate(
    Education = factor(Education,
      levels = c("Left HS", "HS", "Jr Col", "Bachelors", "Graduate"),
      labels = c("Left HS", "HS", "Jr Col", "Bachelors", "Graduate"),
      # ordered = TRUE # Nope can't use this. See below
    ),
    DeathPenalty = factor(DeathPenalty,
      levels = c("Favor", "Oppose"),
      labels = c("Favor", "Oppose")
      # ordered = TRUE # Nope can't use this. See below
    )
  )
glimpse(GSS2002_modified)
Rows: 1,307
Columns: 2
$ Education    <fct> HS, Bachelors, HS, HS, HS, HS, HS, Jr Col, HS, Bachelors,…
$ DeathPenalty <fct> Favor, Favor, Favor, Favor, Favor, Favor, Favor, Favor, O…

We can now set up a Contingency Table.

Table 1: Contingency Table of Education vs Death Penalty
gss_vcd_table <- vcd::structable(Education ~ DeathPenalty, # cols ~ rows
  data = GSS2002_modified
)
gss_vcd_table %>% addmargins()
            Education
DeathPenalty Left HS  HS Jr Col Bachelors Graduate  Sum
      Favor      117 511     71       135       64  898
      Oppose      72 200     16        71       50  409
      Sum        189 711     87       206      114 1307

Contingency Table Plots

The Contingency Table can be plotted using a mosaic plot using two packages, as we have seen. Let us do a quick recap:

Hypotheses Definition

What would our Hypotheses be relating to the proportions of votes for or against the DeathPenalty?

\(H_0: \text{Education does not affect votes for Death Penalty}\\\)

\(H_a: \text{Education affects votes for Death Penalty}\\\)

Inference for Two Proportions

We are now ready to perform our statistical inference. We will use the standard Pearson chi-square test, and develop and intuition for it. We will then do a permutation test to have an alternative method to complete the same task.

Hence, after all this calculation, we have the \(X^2\) statistic, which we can compare with the critical value, and make our inference.

Permutation Test for Education

We will now perform the permutation test for the difference between proportions. We will first get an intuitive idea of the permutation, and then perform it using both mosaic and infer.

Inference for Proportions Case Study-2: TBD dataset

To be Written Up. Yes, but when, Arvind?

Conclusion

In our basic \(X^2\) test, we calculate the test statistic of \(X^2\) and look up a theoretical null distribution for that statistic, and see how unlikely our observed value is.

Why would a permutation test be a good idea here? With a permutation test, there are no assumptions of the null distribution: this is computed based on real data. We note in passing that, in this case, since the number of cases in each cell of the Contingency Table are fairly high ( >= 5) the resulting NULL distribution is of the \(X^2\) variety.

Wait, But Why?

The \(X^2\) test is a very powerful tool for testing the independence of two categorical variables. It is widely used in various fields, including social sciences, biology, and market research. The test is based on the comparison of observed frequencies in a contingency table with expected frequencies under the assumption of independence.

When performing research for a design project, peasants, you are likely to have only several Qual variables, and you will want to see if there is a relationship between them. The \(X^2\) test is a great way to do that. It allows you to test the independence of two categorical variables, which can help you understand the relationships between different factors in your design project.

Your Turn

  1. DaytonSurvey: Substance Abuse among highschoolers. Part of the vcdExtra package. ( Install it, peasants!). dataset: https://friendly.r-universe.dev/vcdExtra/doc/manual.html#DaytonSurvey

    • Is there a relationship between Gender and Substance use?
    • Is there a relationship between Race and Substance use?
  2. Titanic dataset: https://www.kaggle.com/c/titanic/data

    • Is there a relationship between Sex and Survived?
    • Is there a relationship between Pclass and Survived?
    • Is there a relationship between Embarked and Survived?

    Can you prove that Jack would have survived, had Rose been a nice-R person?

  3. Gilby dataset: https://friendly.r-universe.dev/vcdExtra/doc/manual.html#Gilby

    • Is there a relationship between Clothing and Dullness?
  4. Can you show that people who wear ethnic are more likely to be considered dull? Find a dataset to prove or disprove this hypothesis!

References

  1. OpenIntro Modern Statistics: Chapter 17
  2. Chapter 8: The Chi-Square Test, from Statistics at Square One. The British Medical Journal. https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/8-chi-squared-tests. Very readable and easy to grasp. Especially if you like watching Grey’s Anatomy and House.
  3. Exploring the underlying theory of the chi-square test through simulation - part 1 https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/
  4. Exploring the underlying theory of the chi-square test through simulation - part 2 https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/
  5. An Online \(\Xi^2\)-test calculator. https://www.statology.org/chi-square-test-of-independence-calculator/
  6. https://saylordotorg.github.io/text_introductory-statistics/s13-04-comparison-of-two-population-p.html
  7. vcd Tutorial: http://euclid.psych.yorku.ca/datavis.ca/courses/VCD/vcd-tutorial.pdf
R Package Citations
Package Version Citation
ggmosaic 0.3.3 Jeppson, Hofmann, and Cook (2021)
resampledata 0.3.2 Chihara and Hesterberg (2018)
scales 1.4.0 Wickham, Pedersen, and Seidel (2025)
vcd 1.4.13 Meyer, Zeileis, and Hornik (2006); Zeileis, Meyer, and Hornik (2007); Meyer et al. (2024)
Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. John Wiley & Sons Hoboken NJ. https://github.com/lchihara/MathStatsResamplingR?tab=readme-ov-file.
Jeppson, Haley, Heike Hofmann, and Di Cook. 2021. ggmosaic: Mosaic Plots in the ggplot2 Framework. https://doi.org/10.32614/CRAN.package.ggmosaic.
Meyer, David, Achim Zeileis, and Kurt Hornik. 2006. “The Strucplot Framework: Visualizing Multi-Way Contingency Tables with Vcd.” Journal of Statistical Software 17 (3): 1–48. https://doi.org/10.18637/jss.v017.i03.
Meyer, David, Achim Zeileis, Kurt Hornik, and Michael Friendly. 2024. vcd: Visualizing Categorical Data. https://doi.org/10.32614/CRAN.package.vcd.
Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2025. scales: Scale Functions for Visualization. https://doi.org/10.32614/CRAN.package.scales.
Zeileis, Achim, David Meyer, and Kurt Hornik. 2007. “Residual-Based Shadings for Visualizing (Conditional) Independence.” Journal of Computational and Graphical Statistics 16 (3): 507–25. https://doi.org/10.1198/106186007X237856.