Permutation Tests

Author

Arvind V.

Published

November 21, 2022

Modified

October 1, 2025

Abstract

A bunch of case studies with Permutation Tests

Plot Fonts and Theme

Show the Code

library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 8
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

1 Introduction

We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.

We will use two packages in R, mosaic and the relatively new infer package, to develop our intuition for what are called permutation based statistical tests.

2 Hypothesis Testing using Permutation

From Reference #1:

Hypothesis testing can be thought of as a 4-step process:

State the null and alternative hypotheses.

Compute a test statistic.

Determine the p-value.

Draw a conclusion.

In a traditional introductory statistics course, once this general framework has been mastered, the main work is in applying the correct formula to compute the standard test statistics in step 2 and using a table or computer to determine the p-value based on the known (usually approximate) theoretical distribution of the test statistic under the null hypothesis.

In a simulation-based approach, steps 2 and 3 change. In Step 2, it is no longer required that the test statistic be normalized to conform with a known, named distribution. Instead, natural test statistics, like the difference between two sample means \(y1 − y2\) can be used.

In Step 3, we use randomization to approximate the sampling distribution of the test statistic. Our lady tasting tea example demonstrates how this can be done from first principles. More typically, we will use randomization to create new simulated data sets ( “Parallel Worlds”) that are like our original data in some ways, but make the null hypothesis true. For each simulated data set, we calculate our test statistic, just as we did for the original sample. Together, this collection of test statistics computed from the simulated samples constitute our randomization distribution.

When creating a randomization distribution, we will attempt to satisfy 3 guiding principles.

Be consistent with the null hypothesis. We need to simulate a world in which the null hypothesis is true. If we don’t do this, we won’t be testing our null hypothesis.

Use the data in the original sample. The original data should shed light on some aspects of the distribution that are not determined by null hypothesis. For example, a null hypothesis about a mean doesn’t tell us about the shape of the population distribution, but the data give us some indication.

Reflect the way the original data were collected.

From Chihara and Hesterberg:

This is the core idea of statistical significance or classical hypothesis testing – to calculate how often pure random chance would give an effect as large as that observed in the data, in the absence of any real effect. If that probability is small enough, we conclude that the data provide convincing evidence of a real effect.

3 Permutations tests using mosaic::`shuffle()`

The mosaic package provides the shuffle() function as a synonym for sample(). When used without additional arguments, this will permute its first argument.

# library(mosaic)
shuffle(1:10)

 [1]  3  7  2  6  8  1  4 10  9  5

Applying shuffle() to an explanatory variable in a model allows us to test the null hypothesis that the explanatory variable has, in fact, no explanatory power. This idea can be used to test

the equivalence of two or more means,
the equivalence of two or more proportions,
whether a regression parameter is 0. (Correlations between two variables)

Coupled with mosaic::do() we can repeat a shuffle many times, computing a desired statistic each time we shuffle. The distribution of this computed statistic is a NULL distribution, which can then be compared with the observed statistic to decide upon the Hypothesis Test and p-value.

3.1 Case Study-1: Hot Wings Orders vs Gender

A student conducted a study of hot wings and beer consumption at a Bar. She asked patrons at the bar to record their consumption of hot wings and beer over the course of several hours. She wanted to know if people who ate more hot wings would then drink more beer. In addition, she investigated whether or not gender had an impact on hot wings or beer consumption.

Beerwings <- resampledata3::Beerwings # From resampledata3 package
skimr::skim(Beerwings)

Data summary
Name	Beerwings
Number of rows	30
Number of columns	4
_______________________
Column type frequency:
factor	1
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Gender	0	1	FALSE	2	F: 15, M: 15

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	15.50	8.80	1	8.25	15.5	22.75	30	▇▇▇▇▇
Hotwings	1	11.93	4.78	4	8.00	12.5	15.50	21	▅▃▇▃▃
Beer	1	26.20	11.84	0	24.00	30.0	36.00	48	▁▃▅▇▂

Let us calculate the observed difference in Hotwings consumption between Males and Females ( Gender)

mean(Hotwings ~ Gender, data = Beerwings)

        F         M 
 9.333333 14.533333

obs_diff_wings <- mosaic::diffmean(data = Beerwings, Hotwings ~ Gender)
obs_diff_wings

diffmean 
     5.2

gf_boxplot(
  data = Beerwings, Hotwings ~ Gender,
  orientation = "x", fill = ~Gender,
  title = "Hotwings Consumption by Gender"
)

The observed difference in mean consumption of Hotwings between Males and Females is \(5.2\). Could this have occurred by chance? Here is our formulation of the Hypotheses:

\[ NULL\ Hypothesis\ H_0: \mu_F = \mu_M\\ \\ \\ Alternative\ Hypothesis\ H_a: \mu_F \ne \mu_M\\ \]

So we perform a Permutation Test to check:

null_dist_wings <- do(1000) * diffmean(Hotwings ~ shuffle(Gender),
  data = Beerwings
)
null_dist_wings %>% head()

gf_histogram(
  data = null_dist_wings, ~diffmean,
  fill = ~ (diffmean >= obs_diff_wings)
) %>%
  gf_vline(xintercept = obs_diff_wings, colour = "red") %>%
  gf_refine(scale_fill_manual(values = c("lightseagreen", "tomato")))

prop1(~ diffmean >= obs_diff_wings, data = null_dist_wings)

  prop_TRUE 
0.002997003

The \(\color{red}{red\ line}\) shows the actual measured mean difference in Hot Wings consumption. The probability that our Permutation distribution is able to equal or exceed that number is \(0.001998002\) and we have to reject the Null Hypothesis that the means are identical.

3.2 Case Study-2: Verizon

Verizon was an Incumbent Local Exchange Carrier (ILEC), responsible for maintaining land-line phone service in certain areas. Verizon also sold long-distance service, as did a number of competitors, termed Competitive Local Exchange Carriers (CLEC). When something went wrong, Verizon was responsible for repairs, and was supposed to make repairs as quickly for CLEC long-distance customers as for their own.

The New York Public Utilities Commission (PUC) monitored fairness by comparing repair times for Verizon and different CLECs, for different classes of repairs and time periods. In each case a hypothesis test was performed at the 1% significance level, to determine whether repairs for CLEC’s customers were significantly slower than for Verizon’s customers. There were hundreds of such tests. If substantially more than 1% of the tests were significant, then Verizon would pay large penalties.

These tests were performed using t tests; Verizon proposed using permutation tests instead.

verizon <- resampledata3::Verizon # From resampledata3 package
verizon

skim(verizon)

Data summary
Name	verizon
Number of rows	1687
Number of columns	2
_______________________
Column type frequency:
factor	1
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Group	0	1	FALSE	2	ILE: 1664, CLE: 23

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Time	0	1	8.52	14.79	0	0.75	3.63	7.35	191.6	▇▁▁▁▁

Let us make a boxplot of the repair times for Verizon and CLEC:

gf_boxplot(
  data = verizon, Group ~ Time,
  orientation = "y", fill = ~Group,
  title = "Repair Times by Operator"
)

mean(Time ~ Group, data = verizon)

     CLEC      ILEC 
16.509130  8.411611

obs_diff_verizon <- diffmean(Time ~ Group, data = verizon)
obs_diff_verizon

diffmean 
-8.09752

null_dist_verizon <- do(1000) * diffmean(Time ~ shuffle(Group),
  data = verizon
)
gf_histogram(
  data = null_dist_verizon, ~diffmean,
  fill = ~ (diffmean >= obs_diff_verizon)
) %>%
  gf_vline(xintercept = obs_diff_verizon, colour = "red") %>%
  gf_refine(scale_fill_manual(values = c("lightseagreen", "tomato")))

prop1(~ diffmean >= obs_diff_wings, data = null_dist_verizon)

  prop_TRUE 
0.004995005

It seems that the repair Time for CLEC is significantly higher than for the ILEC and there only about a \(1.2\%\) chance that this could have occurred by chance. So we reject the Null Hypothesis that the mean repair Time-s are identical.

3.3 Case Story-3: Recidivism

Do criminals released after a jail term commit crimes again?

recidivism <- resampledata3::Recidivism # From resampledata3 package
recidivism

skim(recidivism)

Data summary
Name	recidivism
Number of rows	17022
Number of columns	7
_______________________
Column type frequency:
factor	6
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Gender	3	1	FALSE	2	M: 14918, F: 2101
Age	3	1	FALSE	5	25-: 6227, 35-: 4035, Und: 3077, 45-: 2872
Age25	3	1	FALSE	2	Ove: 13942, Und: 3077
Offense	0	1	FALSE	2	Fel: 13720, Mis: 3302
Recid	0	1	FALSE	2	No: 11636, Yes: 5386
Type	0	1	FALSE	3	No : 11636, New: 3437, Tec: 1949

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Days	11636	0.32	473.33	283.14	0	241	418	687	1095	▆▇▆▅▃

There are some missing values in the variable Age25. The complete.cases command gives the row numbers where values are not missing. We create a new data frame omitting the rows where there is a missing value in the ‘Age25’ variable.

recidivism_na <- recidivism %>% tidyr::drop_na(Age25)

Also, the variable Recid is a factor variable coded “Yes” or “No”. We convert it to a numeric variable of 1’s and 0’s.

recidivism_na <- recidivism_na %>%
  mutate(Recid2 = ifelse(Recid == "Yes", 1, 0))

obs_diff_recid <- diffmean(Recid2 ~ Age25, data = recidivism_na)
obs_diff_recid

  diffmean 
0.05919913

null_dist_recid <- do(1000) * diffmean(Recid2 ~ shuffle(Age25), data = recidivism_na)

gf_histogram(~diffmean, data = null_dist_recid) %>%
  gf_vline(xintercept = obs_diff_recid, colour = "red")

The probability that the difference in means of Recid2 between the two age groups is as large as the observed value is about \(0.15\), and hence we cannot reject the Null Hypothesis that the means are identical.

3.4 Case Study-4: Matched Pairs: Results from a diving championship.

Diving2017 <- resampledata3::Diving2017 # From resampledata3 package
head(Diving2017)

skim(Diving2017)

Data summary
Name	Diving2017
Number of rows	12
Number of columns	4
_______________________
Column type frequency:
factor	2
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Name	0	1	FALSE	12	SI: 1, BEN: 1, CHA: 1, CHE: 1
Country	0	1	FALSE	8	Can: 2, Chi: 2, Mal: 2, Nor: 2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Semifinal	0	1	338.50	22.95	313.70	322.20	325.62	356.57	382.8	▇▁▂▂▁
Final	0	1	350.48	40.02	283.35	318.59	358.92	387.15	397.5	▃▃▂▆▇

The data is made up of paired observations per swimmer. So we need to take the difference between the two swim records for each swimmer and then shuffle the differences to either polarity. Another way to look at this is to shuffle the records between Semifinal and Final on a per Swimmer basis.

obs_diff_swim <- mean(~ Final - Semifinal, data = Diving2017)
obs_diff_swim

[1] 11.975

polarity <- c(rep(1, 6), rep(-1, 6))
polarity

 [1]  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1

null_dist_swim <- do(100000) * mean(
  data = Diving2017,
  ~ (Final - Semifinal) * resample(polarity,
    replace = TRUE
  )
)
null_dist_swim %>% head()

gf_histogram(data = null_dist_swim, ~mean) %>%
  gf_vline(xintercept = obs_diff_swim, colour = "red")

It appears that there no reason to accept the Alternate Hypothesis that the Diving scores are different across semi-final and final events.

3.5 Case Study #5: Flight Delays

LaGuardia Airport (LGA) is one of three major airports that serves the New York City metropolitan area. In 2008, over 23 million passengers and over 375 000 planes flew in or out of LGA. United Airlines and America Airlines are two major airlines that schedule services at LGA. The data set FlightDelays contains information on all 4029 departures of these two airlines from LGA during May and June 2009.

flightDelays <- resampledata3::FlightDelays # From resampledata3 package
skim(flightDelays)

Data summary
Name	flightDelays
Number of rows	4029
Number of columns	10
_______________________
Column type frequency:
factor	6
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Carrier	1	FALSE	2	AA: 2906, UA: 1123
Destination	1	FALSE	7	ORD: 1785, DFW: 918, MIA: 610, DEN: 264
DepartTime	1	FALSE	5	8-N: 1053, Noo: 1048, 4-8: 972, 4-8: 699
Day	1	FALSE	7	Fri: 637, Mon: 630, Tue: 628, Thu: 566
Month	1	FALSE	2	Jun: 2030, May: 1999
Delayed30	1	FALSE	2	No: 3432, Yes: 597

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	2015.00	1163.22	1	1008	2015	3022	4029	▇▇▇▇▇
FlightNo	1	827.10	551.31	71	371	691	787	2255	▅▇▁▂▂
FlightLength	1	185.30	41.79	68	155	163	228	295	▁▆▇▅▂
Delay	1	11.74	41.63	-19	-6	-3	5	693	▇▁▁▁▁

The variables in the flightDelays dataset are:

flightDelay dataset variables
Variable	Description
Carrier	UA=United Airlines, AA=American Airlines
FlightNo	Flight number
Destination	Airport code
DepartTime	Scheduled departure time in 4 h intervals
Day	Day of the Week
Month	May or June
Delay	Minutes flight delayed (negative indicates early departure)
Delayed30	Departure delayed more than 30 min? Yes or No
FlightLength	Length of time of flight (minutes)

Let us compute the proportion of times that each carrier’s flights was delayed more than 20 min. We will conduct a two-sided test to see if the difference in these proportions is statistically significant.

prop(data = flightDelays, (Delay >= 20) ~ Carrier)

prop_TRUE.AA prop_TRUE.UA 
   0.1713696    0.2226180

obs_diff_delay <- diffprop(data = flightDelays, Delay >= 20 ~ Carrier)
obs_diff_delay

  diffprop 
0.05124841

We see carrier AA has a 17.13% chance of delays>= 20, while UA has 22.26% chance. The difference is 5.12%. Is this statistically significant? We take the Delays for both Carriers and perform a permutation test by shuffle on the carrier variable:

null_dist_delay <- do(4999) * diffprop(data = flightDelays, Delay >= 20 ~ shuffle(Carrier))
null_dist_delay %>% head()

gf_histogram(data = null_dist_delay, ~diffprop) %>% gf_vline(xintercept = obs_diff_delay, color = "red")

It appears that the difference in delay times is significant. We can compute the p-value based on this test:

2 * mean(null_dist_delay >= obs_diff_delay)

[1] 0

which is very small. Hence we reject the null Hypothesis that there is no difference between carriers on delay times.

Compute the variance in the flight delay lengths for each carrier. Conduct a test to see if the variance for United Airlines differs from that of American Airlines.

var(data = flightDelays, Delay ~ Carrier)

      AA       UA 
1606.457 2037.525

# There is no readymade function in mosaic called `diffvar`...so...we construct one
obs_diff_var <- diff(var(data = flightDelays, Delay ~ Carrier))
obs_diff_var

      UA 
431.0677

The difference in variances in Delay between the two carriers is \(-431.0677\). In our Permutation Test, we shuffle the Carrier variable:

obs_diff_var <- diff(var(data = flightDelays, Delay ~ Carrier))
null_dist_var <-
  do(4999) * diff(var(data = flightDelays, Delay ~ shuffle(Carrier)))
null_dist_var %>% head()

# The null distribution variable is called `UA`
gf_histogram(data = null_dist_var, ~UA) %>%
  gf_vline(xintercept = obs_diff_delay, color = "red")

2 * mean(null_dist_var >= obs_diff_var)

[1] 0.2940588

Clearly there is no case for a significant difference in variances!

3.6 Case Study #6: Walmart vs Target

Is there a difference in the price of groceries sold by the two retailers Target and Walmart? The data set Groceries contains a sample of grocery items and their prices advertised on their respective web sites on one specific day.

Inspect the data set, then explain why this is an example of matched pairs data.
Compute summary statistics of the prices for each store.
Conduct a permutation test to determine whether or not there is a difference in the mean prices.
Create a ~~histogram~~ bar-chart of the difference in prices. What is unusual about Quaker Oats Life cereal?
Redo the hypothesis test without this observation. Do you reach the same conclusion?

groceries <- resampledata3::Groceries %>% # From resampledata3 package
  mutate(Product = stringr::str_squish(Product))
groceries

skim(groceries)

Data summary
Name	groceries
Number of rows	30
Number of columns	4
_______________________
Column type frequency:
character	1
factor	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Product	0	1	8	56	0	30	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Size	0	1	FALSE	24	18o: 3, 12o: 2, 15o: 2, 16o: 2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Target	0	1	2.76	1.58	0.99	1.83	2.54	3.14	7.99	▇▇▁▁▁
Walmart	0	1	2.71	1.56	1.00	1.76	2.34	2.96	6.98	▇▅▁▁▂

We see that the comparison is to be made between two prices for the same product, and hence this is one more example of paired data, as in Case Study #4. Let us plot the prices for the products:

Show the Code

gf_col(
  data = groceries,
  Target ~ Product,
  fill = "#0073C299",
  width = 0.5
) %>%
  gf_col(
    data = groceries,
    -Walmart ~ Product,
    fill = "#EFC00099",
    ylab = "Prices",
    width = 0.5
  ) %>%
  gf_col(
    data = groceries %>% filter(Product == "Quaker Oats Life Cereal Original"),
    -Walmart ~ Product,
    fill = "red",
    width = 0.5
  ) %>%
  gf_theme(theme_classic()) %>%
  gf_theme(ggplot2::theme(axis.text.x = element_text(
    size = 8,
    face = "bold",
    vjust = 0,
    hjust = 1
  ))) %>%
  gf_theme(ggplot2::coord_flip())

Figure 1: Figure: Walmart vs Target Prices

We see that the price difference between Walmart and Target prices is highest for the Product named Quaker Oats Life Cereal Original. Let us check the mean difference in prices:

diffmean(data = groceries, Walmart ~ Target, only.2 = FALSE)

   1-0.99    1.22-1 1.42-1.22 1.49-1.42 1.59-1.49 1.62-1.59 1.79-1.62 1.94-1.79 
-0.580000  0.170000  0.210000 -0.100000  0.190000  0.070000  0.180000  0.160000 
1.99-1.94 2.12-1.99 2.39-2.12  2.5-2.39  2.59-2.5 2.64-2.59 2.79-2.64 2.82-2.79 
 0.090000  0.010000  0.200000  0.600000 -0.200000 -0.600000  0.660000  0.040000 
2.99-2.82 3.19-2.99 3.49-3.19 3.99-3.49 4.79-3.99 7.19-4.79 7.99-7.19 
 0.220000  1.263333 -1.183333 -0.480000  2.290000  2.190000  0.000000

obs_diff_price <- mean(~ Walmart - Target, data = groceries)
obs_diff_price

[1] -0.05666667

Let us perform the pair-wise permutation test on prices, by shuffling the two store names:

polarity <- c(rep(1, 15), rep(-1, 15))
polarity

 [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
[26] -1 -1 -1 -1 -1

null_dist_price <- do(100000) * mean(
  data = groceries,
  ~ (Walmart - Target) * resample(polarity,
    replace = TRUE
  )
)
null_dist_price %>% head()

gf_histogram(data = null_dist_price, ~mean) %>%
  gf_vline(xintercept = obs_diff_price, colour = "red")

2 * (sum(null_dist_price >= obs_diff_price + 1) / (100000 + 1)) # P-value

[1] 0

Does not seem to be any significant difference in prices…

Suppose we knock off the Quaker Cereal data item…

which(groceries$Product == "Quaker Oats Life Cereal Original")

[1] 2

groceries_less <- groceries[-2, ]
groceries_less

obs_diff_price_less <- mean(~ Walmart - Target, data = groceries_less)
obs_diff_price_less

[1] -0.1558621

polarity_less <- c(rep(1, 15), rep(-1, 14)) # Due to resampling this small bias makes no difference
null_dist_price_less <- do(100000) * mean(
  data = groceries_less,
  ~ (Walmart - Target) * resample(polarity_less,
    replace = TRUE
  )
)
null_dist_price_less %>% head()

gf_histogram(data = null_dist_price_less, ~mean) %>%
  gf_vline(xintercept = obs_diff_price_less, colour = "red")

1 - mean(null_dist_price_less >= obs_diff_price_less) # P-value

[1] 0.01537

3.7 Case Study 7: Proportions between Categorical Variables

Let us try a dataset with Qualitative / Categorical data. This is a General Social Survey dataset, and we have people with different levels of Education stating their opinion on the Death Penalty. We want to know if these two Categorical variables have a correlation, i.e. can the opinions in favour of the Death Penalty be explained by the Education level?

Since data is Categorical, we need to take counts in a table, and then implement a chi-square test. In the test, we will permute the Education variable to see if we can see how significant its effect size is.

GSS2002 <- resampledata::GSS2002 # From resampledata package

skim(GSS2002)

Data summary
Name	GSS2002
Number of rows	2765
Number of columns	21
_______________________
Column type frequency:
factor	20
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Region	0	1.00	FALSE	7	Nor: 684, Sou: 486, Sou: 471, Mid: 435
Gender	0	1.00	FALSE	2	Fem: 1537, Mal: 1228
Race	0	1.00	FALSE	3	Whi: 2188, Bla: 410, Oth: 167
Education	5	1.00	FALSE	5	HS: 1485, Bac: 443, Lef: 400, Gra: 230
Marital	0	1.00	FALSE	5	Mar: 1269, Nev: 708, Div: 445, Wid: 247
Religion	19	0.99	FALSE	13	Pro: 1460, Cat: 673, Non: 379, Chr: 65
Happy	1396	0.50	FALSE	3	Pre: 784, Ver: 415, Not: 170
Income	890	0.68	FALSE	24	400: 170, 300: 166, 250: 140, 500: 136
PolParty	36	0.99	FALSE	8	Ind: 528, Not: 515, Not: 449, Str: 408
Politics	1434	0.48	FALSE	7	Mod: 522, Con: 210, Sli: 209, Sli: 159
Marijuana	1914	0.31	FALSE	2	Not: 545, Leg: 306
DeathPenalty	1457	0.47	FALSE	2	Fav: 899, Opp: 409
OwnGun	1841	0.33	FALSE	3	No: 605, Yes: 310, Ref: 9
GunLaw	1849	0.33	FALSE	2	Fav: 737, Opp: 179
SpendMilitary	1441	0.48	FALSE	3	Abo: 615, Too: 414, Too: 295
SpendEduc	1422	0.49	FALSE	3	Too: 992, Abo: 278, Too: 73
SpendEnv	1443	0.48	FALSE	3	Too: 793, Abo: 439, Too: 90
SpendSci	1499	0.46	FALSE	3	Abo: 629, Too: 461, Too: 176
Pres00	1016	0.63	FALSE	5	Bus: 885, Gor: 781, Nad: 57, Oth: 16
Postlife	1554	0.44	FALSE	2	Yes: 975, No: 236

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	0	1	1383	798.33	1	692	1383	2074	2765	▇▇▇▇▇

Note how all variables are Categorical !! Education has five levels:

GSS2002 %>% count(Education)

GSS2002 %>% count(DeathPenalty)

Let us drop NA entries in Education and Death Penalty. And set up a table for the chi-square test.

gss2002 <- GSS2002 %>%
  dplyr::select(Education, DeathPenalty) %>%
  tidyr::drop_na(., c(Education, DeathPenalty))
dim(gss2002)

[1] 1307    2

gss_summary <- gss2002 %>%
  mutate(
    Education = factor(
      Education,
      levels = c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS"),
      labels = c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS")
    ),
    DeathPenalty = as.factor(DeathPenalty)
  ) %>%
  group_by(Education, DeathPenalty) %>%
  summarise(count = n()) %>% # This is good for a chisq test

  # Add two more columns to faciltate mosaic/Marrimekko Plot
  #
  mutate(
    edu_count = sum(count),
    edu_prop = count / sum(count)
  ) %>%
  ungroup()

gss_summary

# We can plot a heatmap-like `mosaic chart` for this table, using `ggplot`:
# https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2

ggplot(data = gss_summary, aes(x = Education, y = edu_prop)) +
  geom_bar(aes(width = edu_count, fill = DeathPenalty), stat = "identity", position = "fill", colour = "black") +
  geom_text(aes(label = scales::percent(edu_prop)), position = position_stack(vjust = 0.5)) +


  # if labels are desired
  facet_grid(~Education, scales = "free_x", space = "free_x") +
  theme(scale_fill_brewer(palette = "RdYlGn")) +
  # theme(panel.spacing.x = unit(0, "npc")) + # if no spacing preferred between bars
  theme_void()

Let us now perform the base chisq test: We need a table and then the chisq test:

gss_table <- tally(DeathPenalty ~ Education, data = gss2002)
gss_table

            Education
DeathPenalty Left HS  HS Jr Col Bachelors Graduate
      Favor      117 511     71       135       64
      Oppose      72 200     16        71       50

# Get the observed chi-square statistic
observedChi2 <- mosaic::chisq(tally(DeathPenalty ~ Education, data = gss2002))
observedChi2

X.squared 
 23.45093

# Actual chi-square test
stats::chisq.test(tally(DeathPenalty ~ Education, data = gss2002))


    Pearson's Chi-squared test

data:  tally(DeathPenalty ~ Education, data = gss2002)
X-squared = 23.451, df = 4, p-value = 0.0001029

We should now repeat the test with permutations on Education:

null_chisq <- do(4999) * chisq.test(tally(DeathPenalty ~ shuffle(Education), data = gss2002))

head(null_chisq)

gf_histogram(~X.squared, data = null_chisq) %>%
  gf_vline(xintercept = observedChi2, color = "red")

So we would conclude that Education has a significant effect on DeathPenalty opinion!

4 References

Chapter 11, Hypothesis Testing with Randomization in Introduction to Modern Statistics (1st Ed) by Mine Çetinkaya-Rundel and Johanna Hardin.

R Package Citations

Package	Version	Citation
ggformula	0.12.2	@ggformula
mosaic	1.9.2	@mosaic
resampledata3	1.0	@resampledata3
skimr	2.2.1	@skimr

Plot Fonts and Theme

1 Introduction

2 Hypothesis Testing using Permutation

3 Permutations tests using mosaic::shuffle()

3.1 Case Study-1: Hot Wings Orders vs Gender

3.2 Case Study-2: Verizon

3.3 Case Story-3: Recidivism

3.4 Case Study-4: Matched Pairs: Results from a diving championship.

3.5 Case Study #5: Flight Delays

3.6 Case Study #6: Walmart vs Target

3.7 Case Study 7: Proportions between Categorical Variables

4 References

3 Permutations tests using mosaic::`shuffle()`