The Mad Hatter’s Guide to Data Viz and Stats in R
  1. šŸƒ Permutation Test for Two Proportions
  • Data Viz and Stats
    • Tools
      • Introduction to R and RStudio
    • Descriptive Analytics
      • Data
      • Inspect Data
      • Graphs
      • Summaries
      • Counts
      • Quantities
      • Groups
      • Distributions
      • Groups and Distributions
      • Change
      • Proportions
      • Parts of a Whole
      • Evolution and Flow
      • Ratings and Rankings
      • Surveys
      • Time
      • Space
      • Networks
      • Miscellaneous Graphing Tools, and References
    • Inference
      • Basics of Statistical Inference
      • šŸŽ² Samples, Populations, Statistics and Inference
      • Basics of Randomization Tests
      • Inference for a Single Mean
      • Inference for Two Independent Means
      • Inference for Comparing Two Paired Means
      • Comparing Multiple Means with ANOVA
      • Inference for Correlation
      • Testing a Single Proportion
      • Inference Test for Two Proportions
    • Modelling
      • Modelling with Linear Regression
      • Modelling with Logistic Regression
      • šŸ•” Modelling and Predicting Time Series
    • Workflow
      • Facing the Abyss
      • I Publish, therefore I Am
      • Data Carpentry
    • Arts
      • Colours
      • Fonts in ggplot
      • Annotating Plots: Text, Labels, and Boxes
      • Annotations: Drawing Attention to Parts of the Graph
      • Highlighting parts of the Chart
      • Changing Scales on Charts
      • Assembling a Collage of Plots
      • Making Diagrams in R
    • AI Tools
      • Using gander and ellmer
      • Using Github Copilot and other AI tools to generate R code
      • Using LLMs to Explain Stat models
    • Case Studies
      • Demo:Product Packaging and Elderly People
      • Ikea Furniture
      • Movie Profits
      • Gender at the Work Place
      • Heptathlon
      • School Scores
      • Children's Games
      • Valentine’s Day Spending
      • Women Live Longer?
      • Hearing Loss in Children
      • California Transit Payments
      • Seaweed Nutrients
      • Coffee Flavours
      • Legionnaire’s Disease in the USA
      • Antarctic Sea ice
      • William Farr's Observations on Cholera in London
    • Projects
      • Project: Basics of EDA #1
      • Project: Basics of EDA #2
      • Experiments

On this page

  • 1 Setting up the Packages
  • 2 Introduction
  • 3 Testing for Two or More Proportions
  • 4 Table Plots
    • 4.1 Using ggplot
    • 4.2 Using ggmosaic
    • 4.3 Observed Statistic: the X^2 metric
  • 5 Conclusion

šŸƒ Permutation Test for Two Proportions

Test Proportions

Permutation
Monte Carlo Simulation
Random Number Generation
Distributions
Generating Parallel Worlds
Author

Arvind Venkatadri

Published

November 10, 2022

1 Setting up the Packages

library(tidyverse)
library(mosaic)
library(ggmosaic) # plotting mosaic plots for Categorical Data

### Dataset from Chihara and Hesterberg's book (Second Edition)
library(resampledata)
library(explore)

2 Introduction

We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.

We will use two packages in R, mosaic and the relatively new infer package, to develop our intuition for what are called permutation based statistical tests.

3 Testing for Two or More Proportions

Let us try a dataset with Qualitative / Categorical data. This is the General Social Survey GSS dataset, and we have people with different levels of Education stating their opinion on the Death Penalty. We want to know if these two Categorical variables have a correlation, i.e. can the opinions in favour of the Death Penalty be explained by the Education level?

Since data is Categorical ( both variables ), we need to take counts in a table, and then implement a chi-square test. In the test, we will permute the Education variable to see if we can see how significant its effect size is.

data(GSS2002)
inspect(GSS2002)

categorical variables:  
            name  class levels    n missing
1         Region factor      7 2765       0
2         Gender factor      2 2765       0
3           Race factor      3 2765       0
4      Education factor      5 2760       5
5        Marital factor      5 2765       0
6       Religion factor     13 2746      19
7          Happy factor      3 1369    1396
8         Income factor     24 1875     890
9       PolParty factor      8 2729      36
10      Politics factor      7 1331    1434
11     Marijuana factor      2  851    1914
12  DeathPenalty factor      2 1308    1457
13        OwnGun factor      3  924    1841
14        GunLaw factor      2  916    1849
15 SpendMilitary factor      3 1324    1441
16     SpendEduc factor      3 1343    1422
17      SpendEnv factor      3 1322    1443
18      SpendSci factor      3 1266    1499
19        Pres00 factor      5 1749    1016
20      Postlife factor      2 1211    1554
                                    distribution
1  North Central (24.7%) ...                    
2  Female (55.6%), Male (44.4%)                 
3  White (79.1%), Black (14.8%) ...             
4  HS (53.8%), Bachelors (16.1%) ...            
5  Married (45.9%), Never Married (25.6%) ...   
6  Protestant (53.2%), Catholic (24.5%) ...     
7  Pretty happy (57.3%) ...                     
8  40000-49999 (9.1%) ...                       
9  Ind (19.3%), Not Str Dem (18.9%) ...         
10 Moderate (39.2%), Conservative (15.8%) ...   
11 Not legal (64%), Legal (36%)                 
12 Favor (68.7%), Oppose (31.3%)                
13 No (65.5%), Yes (33.5%) ...                  
14 Favor (80.5%), Oppose (19.5%)                
15 About right (46.5%) ...                      
16 Too little (73.9%) ...                       
17 Too little (60%) ...                         
18 About right (49.7%) ...                      
19 Bush (50.6%), Gore (44.7%) ...               
20 Yes (80.5%), No (19.5%)                      

quantitative variables:  
  name   class min  Q1 median   Q3  max mean       sd    n missing
1   ID integer   1 692   1383 2074 2765 1383 798.3311 2765       0

Note how all variables are Categorical !! Education has five levels:

GSS2002 %>% count(Education)
GSS2002 %>% count(DeathPenalty)

Let us drop NA entries in Education and Death Penalty. And set up a table for the chi-square test.

gss2002 <- GSS2002 %>%
  dplyr::select(Education, DeathPenalty) %>%
  tidyr::drop_na(., c(Education, DeathPenalty))
dim(gss2002)
[1] 1307    2
gss_summary <- gss2002 %>%
  mutate(
    Education = factor(
      Education,
      levels = c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS"),
      labels = c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS")
    ),
    DeathPenalty = as.factor(DeathPenalty)
  ) %>%
  group_by(Education, DeathPenalty) %>%
  summarise(count = n()) %>% # This is good for a chisq test

  # Add two more columns to facilitate mosaic/Marrimekko Plot
  #
  mutate(
    edu_count = sum(count),
    edu_prop = count / sum(count)
  ) %>%
  ungroup()

gss_summary

4 Table Plots

We can plot a heatmap-like mosaic chart for this table.

4.1 Using ggplot

# https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2

ggplot(data = gss_summary, aes(x = Education, y = edu_prop)) +
  geom_bar(aes(width = edu_count, fill = DeathPenalty),
    stat = "identity",
    position = "fill",
    colour = "black"
  ) +
  geom_text(aes(label = scales::percent(edu_prop)),
    position = position_stack(vjust = 0.5)
  ) +


  # if labels are desired
  facet_grid(~Education, scales = "free_x", space = "free_x") +
  theme(scale_fill_brewer(palette = "RdYlGn")) +
  # theme(panel.spacing.x = unit(0, "npc")) + # if no spacing preferred between bars
  theme_void()

4.2 Using ggmosaic

# library(ggmosaic)

ggplot(data = gss2002) +
  geom_mosaic(aes(x = product(DeathPenalty, Education), fill = DeathPenalty))
Error in make_title(..., self = self): unused arguments (list(), "x")

4.3 Observed Statistic: the X^2 metric

When there are multiple proportions involved, the X^2 test is what is used.

Let us now perform the base chisq test: We need a table and then the chisq test:

gss_table <- tally(DeathPenalty ~ Education, data = gss2002)
gss_table
            Education
DeathPenalty Left HS  HS Jr Col Bachelors Graduate
      Favor      117 511     71       135       64
      Oppose      72 200     16        71       50
# Get the observed chi-square statistic
observedChi2 <- mosaic::chisq(tally(DeathPenalty ~ Education, data = gss2002))
observedChi2
X.squared 
 23.45093 
# Actual chi-square test
stats::chisq.test(tally(DeathPenalty ~ Education, data = gss2002))

    Pearson's Chi-squared test

data:  tally(DeathPenalty ~ Education, data = gss2002)
X-squared = 23.451, df = 4, p-value = 0.0001029

What would our Hypotheses be?

$$ H_0: Education Does Not affect Votes on Death Penalty\
H_a: Education affects Votes on Death Penalty

$$

We should now repeat the test with permutations on Education:

null_chisq <- do(10000) * chisq.test(tally(DeathPenalty ~ shuffle(Education), data = gss2002))

head(null_chisq)
gf_histogram(~X.squared, data = null_chisq) %>%
  gf_vline(xintercept = observedChi2, color = "red")

prop1(~ X.squared >= observedChi2, data = null_chisq)
 prop_TRUE 
0.00029997 

The p-value is well below our threshold of \(0.05%\), so we would conclude that Education has a significant effect on DeathPenalty opinion!

5 Conclusion

So, what do you think?

Back to top

License: CC BY-SA 2.0

Website made with ā¤ļø and Quarto, by Arvind V.

Hosted by Netlify .