The Mad Hatter’s Guide to Data Viz and Stats in R
  1. Permutation Tests
  • Data Viz and Stats
    • Tools
      • Introduction to R and RStudio
    • Descriptive Analytics
      • Data
      • Inspect Data
      • Graphs
      • Summaries
      • Counts
      • Quantities
      • Groups
      • Distributions
      • Groups and Distributions
      • Change
      • Proportions
      • Parts of a Whole
      • Evolution and Flow
      • Ratings and Rankings
      • Surveys
      • Time
      • Space
      • Networks
      • Miscellaneous Graphing Tools, and References
    • Inference
      • Basics of Statistical Inference
      • 🎲 Samples, Populations, Statistics and Inference
      • Basics of Randomization Tests
      • Inference for a Single Mean
      • Inference for Two Independent Means
      • Inference for Comparing Two Paired Means
      • Comparing Multiple Means with ANOVA
      • Inference for Correlation
      • Testing a Single Proportion
      • Inference Test for Two Proportions
    • Modelling
      • Modelling with Linear Regression
      • Modelling with Logistic Regression
      • 🕔 Modelling and Predicting Time Series
    • Workflow
      • Facing the Abyss
      • I Publish, therefore I Am
      • Data Carpentry
    • Arts
      • Colours
      • Fonts in ggplot
      • Annotating Plots: Text, Labels, and Boxes
      • Annotations: Drawing Attention to Parts of the Graph
      • Highlighting parts of the Chart
      • Changing Scales on Charts
      • Assembling a Collage of Plots
      • Making Diagrams in R
    • AI Tools
      • Using gander and ellmer
      • Using Github Copilot and other AI tools to generate R code
      • Using LLMs to Explain Stat models
    • Case Studies
      • Demo:Product Packaging and Elderly People
      • Ikea Furniture
      • Movie Profits
      • Gender at the Work Place
      • Heptathlon
      • School Scores
      • Children's Games
      • Valentine’s Day Spending
      • Women Live Longer?
      • Hearing Loss in Children
      • California Transit Payments
      • Seaweed Nutrients
      • Coffee Flavours
      • Legionnaire’s Disease in the USA
      • Antarctic Sea ice
      • William Farr's Observations on Cholera in London
    • Projects
      • Project: Basics of EDA #1
      • Project: Basics of EDA #2
      • Experiments

On this page

  • 1 Introduction
  • 2 Hypothesis Testing using Permutation
  • 3 Permutations tests using mosaic::shuffle()
    • 3.1 Case Study-1: Hot Wings Orders vs Gender
    • 3.2 Case Study-2: Verizon
    • 3.3 Case Story-3: Recidivism
    • 3.4 Case Study-4: Matched Pairs: Results from a diving championship.
    • 3.5 Case Study #5: Flight Delays
    • 3.6 Case Study #6: Walmart vs Target
    • 3.7 Case Study 7: Proportions between Categorical Variables
  • 4 References

Permutation Tests

Author

Arvind V.

Published

November 21, 2022

Modified

October 1, 2025

Abstract
A bunch of case studies with Permutation Tests

Plot Fonts and Theme

Show the Code
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 8
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

1 Introduction

We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.

We will use two packages in R, mosaic and the relatively new infer package, to develop our intuition for what are called permutation based statistical tests.

2 Hypothesis Testing using Permutation

From Reference #1:

Hypothesis testing can be thought of as a 4-step process:

  1. State the null and alternative hypotheses.

  2. Compute a test statistic.

  3. Determine the p-value.

  4. Draw a conclusion.

    In a traditional introductory statistics course, once this general framework has been mastered, the main work is in applying the correct formula to compute the standard test statistics in step 2 and using a table or computer to determine the p-value based on the known (usually approximate) theoretical distribution of the test statistic under the null hypothesis.

    In a simulation-based approach, steps 2 and 3 change. In Step 2, it is no longer required that the test statistic be normalized to conform with a known, named distribution. Instead, natural test statistics, like the difference between two sample means \(y1 − y2\) can be used.

    In Step 3, we use randomization to approximate the sampling distribution of the test statistic. Our lady tasting tea example demonstrates how this can be done from first principles. More typically, we will use randomization to create new simulated data sets ( “Parallel Worlds”) that are like our original data in some ways, but make the null hypothesis true. For each simulated data set, we calculate our test statistic, just as we did for the original sample. Together, this collection of test statistics computed from the simulated samples constitute our randomization distribution.

    When creating a randomization distribution, we will attempt to satisfy 3 guiding principles.

  5. Be consistent with the null hypothesis. We need to simulate a world in which the null hypothesis is true. If we don’t do this, we won’t be testing our null hypothesis.

  6. Use the data in the original sample. The original data should shed light on some aspects of the distribution that are not determined by null hypothesis. For example, a null hypothesis about a mean doesn’t tell us about the shape of the population distribution, but the data give us some indication.

  7. Reflect the way the original data were collected.

From Chihara and Hesterberg:

This is the core idea of statistical significance or classical hypothesis testing – to calculate how often pure random chance would give an effect as large as that observed in the data, in the absence of any real effect. If that probability is small enough, we conclude that the data provide convincing evidence of a real effect.

3 Permutations tests using mosaic::shuffle()

The mosaic package provides the shuffle() function as a synonym for sample(). When used without additional arguments, this will permute its first argument.

# library(mosaic)
shuffle(1:10)
 [1]  3  7  2  6  8  1  4 10  9  5

Applying shuffle() to an explanatory variable in a model allows us to test the null hypothesis that the explanatory variable has, in fact, no explanatory power. This idea can be used to test

  • the equivalence of two or more means,
  • the equivalence of two or more proportions,
  • whether a regression parameter is 0. (Correlations between two variables)

Coupled with mosaic::do() we can repeat a shuffle many times, computing a desired statistic each time we shuffle. The distribution of this computed statistic is a NULL distribution, which can then be compared with the observed statistic to decide upon the Hypothesis Test and p-value.

3.1 Case Study-1: Hot Wings Orders vs Gender

A student conducted a study of hot wings and beer consumption at a Bar. She asked patrons at the bar to record their consumption of hot wings and beer over the course of several hours. She wanted to know if people who ate more hot wings would then drink more beer. In addition, she investigated whether or not gender had an impact on hot wings or beer consumption.

Beerwings <- resampledata3::Beerwings # From resampledata3 package
skimr::skim(Beerwings)
Data summary
Name Beerwings
Number of rows 30
Number of columns 4
_______________________
Column type frequency:
factor 1
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 2 F: 15, M: 15

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 15.50 8.80 1 8.25 15.5 22.75 30 ▇▇▇▇▇
Hotwings 0 1 11.93 4.78 4 8.00 12.5 15.50 21 ▅▃▇▃▃
Beer 0 1 26.20 11.84 0 24.00 30.0 36.00 48 ▁▃▅▇▂

Let us calculate the observed difference in Hotwings consumption between Males and Females ( Gender)

mean(Hotwings ~ Gender, data = Beerwings)
        F         M 
 9.333333 14.533333 
obs_diff_wings <- mosaic::diffmean(data = Beerwings, Hotwings ~ Gender)
obs_diff_wings
diffmean 
     5.2 
gf_boxplot(
  data = Beerwings, Hotwings ~ Gender,
  orientation = "x", fill = ~Gender,
  title = "Hotwings Consumption by Gender"
)

The observed difference in mean consumption of Hotwings between Males and Females is \(5.2\). Could this have occurred by chance? Here is our formulation of the Hypotheses:

\[ NULL\ Hypothesis\ H_0: \mu_F = \mu_M\\ \\ \\ Alternative\ Hypothesis\ H_a: \mu_F \ne \mu_M\\ \]

So we perform a Permutation Test to check:

null_dist_wings <- do(1000) * diffmean(Hotwings ~ shuffle(Gender),
  data = Beerwings
)
null_dist_wings %>% head()
gf_histogram(
  data = null_dist_wings, ~diffmean,
  fill = ~ (diffmean >= obs_diff_wings)
) %>%
  gf_vline(xintercept = obs_diff_wings, colour = "red") %>%
  gf_refine(scale_fill_manual(values = c("lightseagreen", "tomato")))

prop1(~ diffmean >= obs_diff_wings, data = null_dist_wings)
  prop_TRUE 
0.002997003 

The \(\color{red}{red\ line}\) shows the actual measured mean difference in Hot Wings consumption. The probability that our Permutation distribution is able to equal or exceed that number is \(0.001998002\) and we have to reject the Null Hypothesis that the means are identical.

3.2 Case Study-2: Verizon

Verizon was an Incumbent Local Exchange Carrier (ILEC), responsible for maintaining land-line phone service in certain areas. Verizon also sold long-distance service, as did a number of competitors, termed Competitive Local Exchange Carriers (CLEC). When something went wrong, Verizon was responsible for repairs, and was supposed to make repairs as quickly for CLEC long-distance customers as for their own.

The New York Public Utilities Commission (PUC) monitored fairness by comparing repair times for Verizon and different CLECs, for different classes of repairs and time periods. In each case a hypothesis test was performed at the 1% significance level, to determine whether repairs for CLEC’s customers were significantly slower than for Verizon’s customers. There were hundreds of such tests. If substantially more than 1% of the tests were significant, then Verizon would pay large penalties.

These tests were performed using t tests; Verizon proposed using permutation tests instead.

verizon <- resampledata3::Verizon # From resampledata3 package
verizon
skim(verizon)
Data summary
Name verizon
Number of rows 1687
Number of columns 2
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Group 0 1 FALSE 2 ILE: 1664, CLE: 23

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time 0 1 8.52 14.79 0 0.75 3.63 7.35 191.6 ▇▁▁▁▁

Let us make a boxplot of the repair times for Verizon and CLEC:

gf_boxplot(
  data = verizon, Group ~ Time,
  orientation = "y", fill = ~Group,
  title = "Repair Times by Operator"
)

mean(Time ~ Group, data = verizon)
     CLEC      ILEC 
16.509130  8.411611 
obs_diff_verizon <- diffmean(Time ~ Group, data = verizon)
obs_diff_verizon
diffmean 
-8.09752 
null_dist_verizon <- do(1000) * diffmean(Time ~ shuffle(Group),
  data = verizon
)
gf_histogram(
  data = null_dist_verizon, ~diffmean,
  fill = ~ (diffmean >= obs_diff_verizon)
) %>%
  gf_vline(xintercept = obs_diff_verizon, colour = "red") %>%
  gf_refine(scale_fill_manual(values = c("lightseagreen", "tomato")))

prop1(~ diffmean >= obs_diff_wings, data = null_dist_verizon)
  prop_TRUE 
0.004995005 

It seems that the repair Time for CLEC is significantly higher than for the ILEC and there only about a \(1.2\%\) chance that this could have occurred by chance. So we reject the Null Hypothesis that the mean repair Time-s are identical.

3.3 Case Story-3: Recidivism

Do criminals released after a jail term commit crimes again?

recidivism <- resampledata3::Recidivism # From resampledata3 package
recidivism
skim(recidivism)
Data summary
Name recidivism
Number of rows 17022
Number of columns 7
_______________________
Column type frequency:
factor 6
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 3 1 FALSE 2 M: 14918, F: 2101
Age 3 1 FALSE 5 25-: 6227, 35-: 4035, Und: 3077, 45-: 2872
Age25 3 1 FALSE 2 Ove: 13942, Und: 3077
Offense 0 1 FALSE 2 Fel: 13720, Mis: 3302
Recid 0 1 FALSE 2 No: 11636, Yes: 5386
Type 0 1 FALSE 3 No : 11636, New: 3437, Tec: 1949

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Days 11636 0.32 473.33 283.14 0 241 418 687 1095 ▆▇▆▅▃

There are some missing values in the variable Age25. The complete.cases command gives the row numbers where values are not missing. We create a new data frame omitting the rows where there is a missing value in the ‘Age25’ variable.

recidivism_na <- recidivism %>% tidyr::drop_na(Age25)

Also, the variable Recid is a factor variable coded “Yes” or “No”. We convert it to a numeric variable of 1’s and 0’s.

recidivism_na <- recidivism_na %>%
  mutate(Recid2 = ifelse(Recid == "Yes", 1, 0))

obs_diff_recid <- diffmean(Recid2 ~ Age25, data = recidivism_na)
obs_diff_recid
  diffmean 
0.05919913 
null_dist_recid <- do(1000) * diffmean(Recid2 ~ shuffle(Age25), data = recidivism_na)

gf_histogram(~diffmean, data = null_dist_recid) %>%
  gf_vline(xintercept = obs_diff_recid, colour = "red")

The probability that the difference in means of Recid2 between the two age groups is as large as the observed value is about \(0.15\), and hence we cannot reject the Null Hypothesis that the means are identical.

3.4 Case Study-4: Matched Pairs: Results from a diving championship.

Diving2017 <- resampledata3::Diving2017 # From resampledata3 package
head(Diving2017)
skim(Diving2017)
Data summary
Name Diving2017
Number of rows 12
Number of columns 4
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Name 0 1 FALSE 12 SI: 1, BEN: 1, CHA: 1, CHE: 1
Country 0 1 FALSE 8 Can: 2, Chi: 2, Mal: 2, Nor: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Semifinal 0 1 338.50 22.95 313.70 322.20 325.62 356.57 382.8 ▇▁▂▂▁
Final 0 1 350.48 40.02 283.35 318.59 358.92 387.15 397.5 ▃▃▂▆▇

The data is made up of paired observations per swimmer. So we need to take the difference between the two swim records for each swimmer and then shuffle the differences to either polarity. Another way to look at this is to shuffle the records between Semifinal and Final on a per Swimmer basis.

obs_diff_swim <- mean(~ Final - Semifinal, data = Diving2017)
obs_diff_swim
[1] 11.975
polarity <- c(rep(1, 6), rep(-1, 6))
polarity
 [1]  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1
null_dist_swim <- do(100000) * mean(
  data = Diving2017,
  ~ (Final - Semifinal) * resample(polarity,
    replace = TRUE
  )
)
null_dist_swim %>% head()
gf_histogram(data = null_dist_swim, ~mean) %>%
  gf_vline(xintercept = obs_diff_swim, colour = "red")

It appears that there no reason to accept the Alternate Hypothesis that the Diving scores are different across semi-final and final events.

3.5 Case Study #5: Flight Delays

LaGuardia Airport (LGA) is one of three major airports that serves the New York City metropolitan area. In 2008, over 23 million passengers and over 375 000 planes flew in or out of LGA. United Airlines and America Airlines are two major airlines that schedule services at LGA. The data set FlightDelays contains information on all 4029 departures of these two airlines from LGA during May and June 2009.

flightDelays <- resampledata3::FlightDelays # From resampledata3 package
skim(flightDelays)
Data summary
Name flightDelays
Number of rows 4029
Number of columns 10
_______________________
Column type frequency:
factor 6
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Carrier 0 1 FALSE 2 AA: 2906, UA: 1123
Destination 0 1 FALSE 7 ORD: 1785, DFW: 918, MIA: 610, DEN: 264
DepartTime 0 1 FALSE 5 8-N: 1053, Noo: 1048, 4-8: 972, 4-8: 699
Day 0 1 FALSE 7 Fri: 637, Mon: 630, Tue: 628, Thu: 566
Month 0 1 FALSE 2 Jun: 2030, May: 1999
Delayed30 0 1 FALSE 2 No: 3432, Yes: 597

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 2015.00 1163.22 1 1008 2015 3022 4029 ▇▇▇▇▇
FlightNo 0 1 827.10 551.31 71 371 691 787 2255 ▅▇▁▂▂
FlightLength 0 1 185.30 41.79 68 155 163 228 295 ▁▆▇▅▂
Delay 0 1 11.74 41.63 -19 -6 -3 5 693 ▇▁▁▁▁

The variables in the flightDelays dataset are:

flightDelay dataset variables
Variable Description
Carrier UA=United Airlines, AA=American Airlines
FlightNo Flight number
Destination Airport code
DepartTime Scheduled departure time in 4 h intervals
Day Day of the Week
Month May or June
Delay Minutes flight delayed (negative indicates early departure)
Delayed30 Departure delayed more than 30 min? Yes or No
FlightLength Length of time of flight (minutes)
  1. Let us compute the proportion of times that each carrier’s flights was delayed more than 20 min. We will conduct a two-sided test to see if the difference in these proportions is statistically significant.
prop(data = flightDelays, (Delay >= 20) ~ Carrier)
prop_TRUE.AA prop_TRUE.UA 
   0.1713696    0.2226180 
obs_diff_delay <- diffprop(data = flightDelays, Delay >= 20 ~ Carrier)
obs_diff_delay
  diffprop 
0.05124841 

We see carrier AA has a 17.13% chance of delays>= 20, while UA has 22.26% chance. The difference is 5.12%. Is this statistically significant? We take the Delays for both Carriers and perform a permutation test by shuffle on the carrier variable:

null_dist_delay <- do(4999) * diffprop(data = flightDelays, Delay >= 20 ~ shuffle(Carrier))
null_dist_delay %>% head()
gf_histogram(data = null_dist_delay, ~diffprop) %>% gf_vline(xintercept = obs_diff_delay, color = "red")

It appears that the difference in delay times is significant. We can compute the p-value based on this test:

2 * mean(null_dist_delay >= obs_diff_delay)
[1] 0

which is very small. Hence we reject the null Hypothesis that there is no difference between carriers on delay times.

  1. Compute the variance in the flight delay lengths for each carrier. Conduct a test to see if the variance for United Airlines differs from that of American Airlines.
var(data = flightDelays, Delay ~ Carrier)
      AA       UA 
1606.457 2037.525 
# There is no readymade function in mosaic called `diffvar`...so...we construct one
obs_diff_var <- diff(var(data = flightDelays, Delay ~ Carrier))
obs_diff_var
      UA 
431.0677 

The difference in variances in Delay between the two carriers is \(-431.0677\). In our Permutation Test, we shuffle the Carrier variable:

obs_diff_var <- diff(var(data = flightDelays, Delay ~ Carrier))
null_dist_var <-
  do(4999) * diff(var(data = flightDelays, Delay ~ shuffle(Carrier)))
null_dist_var %>% head()
# The null distribution variable is called `UA`
gf_histogram(data = null_dist_var, ~UA) %>%
  gf_vline(xintercept = obs_diff_delay, color = "red")

2 * mean(null_dist_var >= obs_diff_var)
[1] 0.2940588

Clearly there is no case for a significant difference in variances!

3.6 Case Study #6: Walmart vs Target

Is there a difference in the price of groceries sold by the two retailers Target and Walmart? The data set Groceries contains a sample of grocery items and their prices advertised on their respective web sites on one specific day.

  1. Inspect the data set, then explain why this is an example of matched pairs data.
  2. Compute summary statistics of the prices for each store.
  3. Conduct a permutation test to determine whether or not there is a difference in the mean prices.
  4. Create a histogram bar-chart of the difference in prices. What is unusual about Quaker Oats Life cereal?
  5. Redo the hypothesis test without this observation. Do you reach the same conclusion?
groceries <- resampledata3::Groceries %>% # From resampledata3 package
  mutate(Product = stringr::str_squish(Product))
groceries
skim(groceries)
Data summary
Name groceries
Number of rows 30
Number of columns 4
_______________________
Column type frequency:
character 1
factor 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Product 0 1 8 56 0 30 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Size 0 1 FALSE 24 18o: 3, 12o: 2, 15o: 2, 16o: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Target 0 1 2.76 1.58 0.99 1.83 2.54 3.14 7.99 ▇▇▁▁▁
Walmart 0 1 2.71 1.56 1.00 1.76 2.34 2.96 6.98 ▇▅▁▁▂

We see that the comparison is to be made between two prices for the same product, and hence this is one more example of paired data, as in Case Study #4. Let us plot the prices for the products:

Show the Code
gf_col(
  data = groceries,
  Target ~ Product,
  fill = "#0073C299",
  width = 0.5
) %>%
  gf_col(
    data = groceries,
    -Walmart ~ Product,
    fill = "#EFC00099",
    ylab = "Prices",
    width = 0.5
  ) %>%
  gf_col(
    data = groceries %>% filter(Product == "Quaker Oats Life Cereal Original"),
    -Walmart ~ Product,
    fill = "red",
    width = 0.5
  ) %>%
  gf_theme(theme_classic()) %>%
  gf_theme(ggplot2::theme(axis.text.x = element_text(
    size = 8,
    face = "bold",
    vjust = 0,
    hjust = 1
  ))) %>%
  gf_theme(ggplot2::coord_flip())
Figure 1: Figure: Walmart vs Target Prices

We see that the price difference between Walmart and Target prices is highest for the Product named Quaker Oats Life Cereal Original. Let us check the mean difference in prices:

diffmean(data = groceries, Walmart ~ Target, only.2 = FALSE)
   1-0.99    1.22-1 1.42-1.22 1.49-1.42 1.59-1.49 1.62-1.59 1.79-1.62 1.94-1.79 
-0.580000  0.170000  0.210000 -0.100000  0.190000  0.070000  0.180000  0.160000 
1.99-1.94 2.12-1.99 2.39-2.12  2.5-2.39  2.59-2.5 2.64-2.59 2.79-2.64 2.82-2.79 
 0.090000  0.010000  0.200000  0.600000 -0.200000 -0.600000  0.660000  0.040000 
2.99-2.82 3.19-2.99 3.49-3.19 3.99-3.49 4.79-3.99 7.19-4.79 7.99-7.19 
 0.220000  1.263333 -1.183333 -0.480000  2.290000  2.190000  0.000000 
obs_diff_price <- mean(~ Walmart - Target, data = groceries)
obs_diff_price
[1] -0.05666667

Let us perform the pair-wise permutation test on prices, by shuffling the two store names:

polarity <- c(rep(1, 15), rep(-1, 15))
polarity
 [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
[26] -1 -1 -1 -1 -1
null_dist_price <- do(100000) * mean(
  data = groceries,
  ~ (Walmart - Target) * resample(polarity,
    replace = TRUE
  )
)
null_dist_price %>% head()
gf_histogram(data = null_dist_price, ~mean) %>%
  gf_vline(xintercept = obs_diff_price, colour = "red")

2 * (sum(null_dist_price >= obs_diff_price + 1) / (100000 + 1)) # P-value
[1] 0

Does not seem to be any significant difference in prices…

Suppose we knock off the Quaker Cereal data item…

which(groceries$Product == "Quaker Oats Life Cereal Original")
[1] 2
groceries_less <- groceries[-2, ]
groceries_less
obs_diff_price_less <- mean(~ Walmart - Target, data = groceries_less)
obs_diff_price_less
[1] -0.1558621
polarity_less <- c(rep(1, 15), rep(-1, 14)) # Due to resampling this small bias makes no difference
null_dist_price_less <- do(100000) * mean(
  data = groceries_less,
  ~ (Walmart - Target) * resample(polarity_less,
    replace = TRUE
  )
)
null_dist_price_less %>% head()
gf_histogram(data = null_dist_price_less, ~mean) %>%
  gf_vline(xintercept = obs_diff_price_less, colour = "red")

1 - mean(null_dist_price_less >= obs_diff_price_less) # P-value
[1] 0.01537

3.7 Case Study 7: Proportions between Categorical Variables

Let us try a dataset with Qualitative / Categorical data. This is a General Social Survey dataset, and we have people with different levels of Education stating their opinion on the Death Penalty. We want to know if these two Categorical variables have a correlation, i.e. can the opinions in favour of the Death Penalty be explained by the Education level?

Since data is Categorical, we need to take counts in a table, and then implement a chi-square test. In the test, we will permute the Education variable to see if we can see how significant its effect size is.

GSS2002 <- resampledata::GSS2002 # From resampledata package

skim(GSS2002)
Data summary
Name GSS2002
Number of rows 2765
Number of columns 21
_______________________
Column type frequency:
factor 20
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Region 0 1.00 FALSE 7 Nor: 684, Sou: 486, Sou: 471, Mid: 435
Gender 0 1.00 FALSE 2 Fem: 1537, Mal: 1228
Race 0 1.00 FALSE 3 Whi: 2188, Bla: 410, Oth: 167
Education 5 1.00 FALSE 5 HS: 1485, Bac: 443, Lef: 400, Gra: 230
Marital 0 1.00 FALSE 5 Mar: 1269, Nev: 708, Div: 445, Wid: 247
Religion 19 0.99 FALSE 13 Pro: 1460, Cat: 673, Non: 379, Chr: 65
Happy 1396 0.50 FALSE 3 Pre: 784, Ver: 415, Not: 170
Income 890 0.68 FALSE 24 400: 170, 300: 166, 250: 140, 500: 136
PolParty 36 0.99 FALSE 8 Ind: 528, Not: 515, Not: 449, Str: 408
Politics 1434 0.48 FALSE 7 Mod: 522, Con: 210, Sli: 209, Sli: 159
Marijuana 1914 0.31 FALSE 2 Not: 545, Leg: 306
DeathPenalty 1457 0.47 FALSE 2 Fav: 899, Opp: 409
OwnGun 1841 0.33 FALSE 3 No: 605, Yes: 310, Ref: 9
GunLaw 1849 0.33 FALSE 2 Fav: 737, Opp: 179
SpendMilitary 1441 0.48 FALSE 3 Abo: 615, Too: 414, Too: 295
SpendEduc 1422 0.49 FALSE 3 Too: 992, Abo: 278, Too: 73
SpendEnv 1443 0.48 FALSE 3 Too: 793, Abo: 439, Too: 90
SpendSci 1499 0.46 FALSE 3 Abo: 629, Too: 461, Too: 176
Pres00 1016 0.63 FALSE 5 Bus: 885, Gor: 781, Nad: 57, Oth: 16
Postlife 1554 0.44 FALSE 2 Yes: 975, No: 236

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 1383 798.33 1 692 1383 2074 2765 ▇▇▇▇▇

Note how all variables are Categorical !! Education has five levels:

GSS2002 %>% count(Education)
GSS2002 %>% count(DeathPenalty)

Let us drop NA entries in Education and Death Penalty. And set up a table for the chi-square test.

gss2002 <- GSS2002 %>%
  dplyr::select(Education, DeathPenalty) %>%
  tidyr::drop_na(., c(Education, DeathPenalty))
dim(gss2002)
[1] 1307    2
gss_summary <- gss2002 %>%
  mutate(
    Education = factor(
      Education,
      levels = c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS"),
      labels = c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS")
    ),
    DeathPenalty = as.factor(DeathPenalty)
  ) %>%
  group_by(Education, DeathPenalty) %>%
  summarise(count = n()) %>% # This is good for a chisq test

  # Add two more columns to faciltate mosaic/Marrimekko Plot
  #
  mutate(
    edu_count = sum(count),
    edu_prop = count / sum(count)
  ) %>%
  ungroup()

gss_summary
# We can plot a heatmap-like `mosaic chart` for this table, using `ggplot`:
# https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2

ggplot(data = gss_summary, aes(x = Education, y = edu_prop)) +
  geom_bar(aes(width = edu_count, fill = DeathPenalty), stat = "identity", position = "fill", colour = "black") +
  geom_text(aes(label = scales::percent(edu_prop)), position = position_stack(vjust = 0.5)) +


  # if labels are desired
  facet_grid(~Education, scales = "free_x", space = "free_x") +
  theme(scale_fill_brewer(palette = "RdYlGn")) +
  # theme(panel.spacing.x = unit(0, "npc")) + # if no spacing preferred between bars
  theme_void()

Let us now perform the base chisq test: We need a table and then the chisq test:

gss_table <- tally(DeathPenalty ~ Education, data = gss2002)
gss_table
            Education
DeathPenalty Left HS  HS Jr Col Bachelors Graduate
      Favor      117 511     71       135       64
      Oppose      72 200     16        71       50
# Get the observed chi-square statistic
observedChi2 <- mosaic::chisq(tally(DeathPenalty ~ Education, data = gss2002))
observedChi2
X.squared 
 23.45093 
# Actual chi-square test
stats::chisq.test(tally(DeathPenalty ~ Education, data = gss2002))

    Pearson's Chi-squared test

data:  tally(DeathPenalty ~ Education, data = gss2002)
X-squared = 23.451, df = 4, p-value = 0.0001029

We should now repeat the test with permutations on Education:

null_chisq <- do(4999) * chisq.test(tally(DeathPenalty ~ shuffle(Education), data = gss2002))

head(null_chisq)
gf_histogram(~X.squared, data = null_chisq) %>%
  gf_vline(xintercept = observedChi2, color = "red")

So we would conclude that Education has a significant effect on DeathPenalty opinion!

4 References

  1. Chapter 11, Hypothesis Testing with Randomization in Introduction to Modern Statistics (1st Ed) by Mine Çetinkaya-Rundel and Johanna Hardin.

R Package Citations

Package Version Citation
ggformula 0.12.2 @ggformula
mosaic 1.9.2 @mosaic
resampledata3 1.0 @resampledata3
skimr 2.2.1 @skimr
Back to top

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .