Inference for Comparing Two Paired Means

A pig don’t get fatter, the more you weigh it!

Author

Arvind V.

Published

November 10, 2022

Modified

October 18, 2025

Abstract

Means from repeated measurements

1 Setting up R Packages

library(tidyverse)
library(mosaic)
library(broom) # Tidy Test data
library(resampledata3) # Datasets from Chihara and Hesterberg's book
library(gt) # for tables
##
library(visStatistics) # All-in-one Stats test package

Plot Fonts and Theme

Show the Code

library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    # theme(panel.widths = unit(11, "cm"),
    #       panel.heights = unit(6.79, "cm")) + # Golden Ratio

    theme(
      plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
      plot.background = element_rect(
        fill = "bisque",
        colour = "black",
        linewidth = 1
      )
    ) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 10
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 8
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

2 Introduction to Inference for Paired Data

2.1 What is Paired Data?

Sometimes the data is collected on the same set of individual categories, e.g. scores by the same sport persons in two separate tournaments, or sales of identical items in two separate locations of a chain store. Or say the number of customers in the morning and in the evening, at a set of store locations. In this case we treat the two sets of observations as paired, since they correspond to the same set of observable entities (Qual !!) . This is how some experiments give us paired data.

We would naturally be interested in the differences in means across these two sets, which exploits this paired nature. In this module, we will examine tests for this purpose.

2.2 Workflow for Inference for Paired Means

flowchart TD
    A[Inference for Paired Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test with paired=TRUE]
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction, paired =TRUE]
    C -->|No\n Non-Parametric| E[wilcox.test with paired = TRUE]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]

We will now use a couple to case studies to traverse all the possible pathways in the Workflow above.

3 Case Study #1: Results from a Diving Championship

Here we have swimming records across a Semi-Final and a Final:

3.1 Inspecting and Charting Data

data("Diving2017", package = "resampledata3")
Diving2017
Diving2017_inspect <- inspect(Diving2017)
Diving2017_inspect$categorical
Diving2017_inspect$quantitative

The data is made up of paired observations per swimmer, one for the semi-final and one for the final race. There are 12 swimmers and therefore 12 paired records. How can we quickly visualize this data?

Let us first make this data into long form:

Diving2017_long <- Diving2017 %>%
  pivot_longer(
    cols = c(Final, Semifinal),
    names_to = "race",
    values_to = "scores"
  )
Diving2017_long

Next, histograms and densities of the two variables at hand:

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_density(~scores,
    fill = ~race,
    alpha = 0.5,
    title = "Diving Scores"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_facet_grid(~race) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_theme(theme(legend.position = "none")) # No Need!!

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_col(
    fct_reorder(Name, scores) ~ scores,
    fill = ~race,
    alpha = 0.5,
    position = "dodge",
    xlab = "Scores",
    ylab = "Name",
    title = "Diving Scores"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1"))

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_boxplot(
    scores ~ race,
    fill = ~race,
    alpha = 0.5,
    xlab = "Race",
    ylab = "Scores",
    title = "Diving Scores", orientation = "x"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_theme(theme(legend.position = "none")) # No Need

We see that:

The data are not normally distributed. With just such few readings (n < 30) it was just possible…more readings would have helped. We will verify this aspect formally very shortly.
There is no immediately identifiable trend in score changes from one race to the other.
Although the two medians appear to be different, the box plots overlap considerably. So one cannot visually conclude that the two sets of race timings have different means.

3.2 A. Check for Normality

Let us also complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution.

shapiro.test(Diving2017$Final)
shapiro.test(Diving2017$Semifinal)


    Shapiro-Wilk normality test

data:  Diving2017$Final
W = 0.9184, p-value = 0.273


    Shapiro-Wilk normality test

data:  Diving2017$Semifinal
W = 0.86554, p-value = 0.05738

Hmmm….the Shapiro-Wilk test suggests that both scores are normally distributed (!!!), though Semifinal is probably marginally so.

Can we check this comparison also with plots? We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_qq(~ scores | race, size = 2, colour = ~race) %>%
  gf_qqline(
    ylab = "scores", colour = ~race,
    xlab = "theoretical normal", title = "Q-Q Plots for Paired Race Data"
  ) %>%
  gf_theme(theme(legend.position = "none"))

We see in the QQ-plots that the Final scores are closer to the straight line than the Semifinal scores. But it is perhaps still hard to accept the data as normally distributed…hmm.

3.3 B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(scores ~ race,
  data = Diving2017_long,
  ratio = 1, # What we believe
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

The variances are not significantly different, as seen by the \(p.value = 0.08\).

So to summarise our data checks:

Conditions

data are normally distributed
variances are not significantly different

3.4 Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in performance between the population of swimmers who might have participated in these two races. So accordingly:

\[ H_0: \mu_{semifinal} = \mu_{final}\\ \]

\[ H_a: \mu_{semifinal} \ne \mu_{final}\ \]

3.5 Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, with the argument only.2 = FALSE to allow for paired data:

obs_diff_swim <- diffmean(scores ~ race,
  data = Diving2017_long,
  only.2 = FALSE
) # paired data

# Can use this also
# formula method is better for permutation test!
# obs_diff_swim <- mean(~ (Final - Semifinal), data = Diving2017)

obs_diff_swim

diffmean 
 -11.975

3.6 Inference

Since the data variables satisfy the assumption of being normally distributed, and the variances are not significantly different, we may attempt the classical t.test with paired data. (we will use the mosaic variant). Type help(t.test) in your Console. Our model would be:

\[ mean(Final(i) - Semi\_final(i)) = \beta_0 \\ \]

And that: \[ H_0: \mu_{final} - \mu_{semifinal} = 0; \] \[ H_a: \mu_{final} - \mu_{semifinal} \ne 0;\\ \]

mosaic::t.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE, var.equal = FALSE
) %>% broom::tidy()

The confidence interval spans the zero value, and the p.value is a high \(0.259\), so there is no reason to accept alternative hypothesis that the means are different. Hence we say that there is no evidence of a difference between SemiFinal and Final scores.

Well, we might consider ( based on knowledge of the sport ) that at least one of the variables does not meet the normality criteria, and though their variances are not significantly different. So we would attempt a non-parametric Wilcoxon test, that uses the signed-rank of the paired data differences, instead of the data variables. Our model would be:

\[ mean(\ sign.rank[\ Final(i) - Semifinal(i)\ ]\ ) = \beta_0 \\ \] \[ H_0: \mu_{final} - \mu_{semifinal} = 0; \] \[ H_a: \mu_{final} - \mu_{semifinal} \ne 0;\\ \]

wilcox.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  mu = 0, # belief
  alternative = "two.sided", # difference either way
  paired = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

Here also with the p.value being \(0.3804\), we have no reason to accept the Alternative Hypothesis. The parametric t.test and the non-parametric wilcox.test agree in their inferences.

We can apply the linear-model-as-inference interpretation both to the original data and to the sign.rank data:

\[ lm(y_i - x_i \sim 1) = \beta_0 \]

and

\[ lm(\ sign.rank[\ Final(i) - Semifinal(i)\ ] \sim 1) = \beta_0\\ \]

And the Hypothesis for both interpretations would be:
\[ H_0: \beta_0 = 0;\\ \\\ H_a: \beta_0 \ne 0\\ \]

lm(Semifinal - Final ~ 1, data = Diving2017) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

# Create a sign-rank function
signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

lm(signed_rank(Semifinal - Final) ~ 1,
  data = Diving2017
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

We observe that using the linear model method for the original scores and the sign-rank scores both do not permit us to reject the \(H_0\) Null Hypothesis, since p.values are high, and the confidence.intervals straddle \(0\).

For the specific data at hand, we need to shuffle the records between Semifinal and Final on a per Swimmer basis (paired data!!) and take the test statistic (difference between the two swim records for each swimmer). Another way to look at this is to take the differences between Semifinal and Final scores and shuffle the differences to either polarity. We will follow this method in the code below:

Show the Code

polarity <- c(rep(1, 6), rep(-1, 6))
# 12 +/- 1s,
# 6 each to make sure there is equal probability
polarity
##
null_dist_swim <- do(4999) *
  mean(
    data = Diving2017,
    ~ (Final - Semifinal) * # take (pairwise) differences
      mosaic::resample(polarity, # Swap polarity randomly
        replace = TRUE
      )
  )
##
null_dist_swim

 [1]  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1

Let us plot the NULL distribution and compare it with the actual observed differences in the race times:

Show the Code

ggplot2::theme_set(new = theme_custom())

gf_histogram(data = null_dist_swim, ~mean) %>%
  gf_vline(
    xintercept = obs_diff_swim,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Histogram"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
###
gf_ecdf(data = null_dist_swim, ~mean, linewidth = 1) %>%
  gf_vline(
    xintercept = obs_diff_swim,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Cumulative Density"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
###
prop1(~ mean <= obs_diff_swim, data = null_dist_swim)

prop_TRUE 
   0.1346

Hmm…so by generating 4999 shuffles of score-difference polarities, it does appear that we can not only obtain the current observed difference but even surpass it frequently. So it does seem that there is no difference in means between Semi-Final and Final swimming scores.

3.7 All Tests Together

We can put all the test results together to get a few more insights about the tests:

Show the Code

mosaic::t.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "t.test")

lm(Semifinal - Final ~ 1, data = Diving2017) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model")

wilcox.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Wilcoxon test")

lm(signed_rank(Semifinal - Final) ~ 1,
  data = Diving2017
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model with sign.rank")

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
t.test
-11.975	-1.190339	0.2589684	11	-34.11726	10.16726	Paired t-test	two.sided

term	estimate	std.error	statistic	p.value	conf.low	conf.high
Linear Model
(Intercept)	-11.975	10.06016	-1.190339	0.2589684	-34.11726	10.16726

statistic	p.value	method	alternative
Wilcoxon test
27	0.3803711	Wilcoxon signed rank exact test	two.sided

term	estimate	std.error	statistic	p.value	conf.low	conf.high
Linear Model with sign.rank
(Intercept)	-2	2.135558	-0.9365236	0.3691097	-6.70033	2.70033

The linear model and the t.test are nearly identical in performance; the p.values are the same. The same is also true of the wilcox.test and the linear model with sign-rank data differences. This is of course not surprising!

4 Case Study #2: Walmart vs Target

Is there a difference in the price of Groceries sold by the two retailers Target and Walmart? The data set Groceries contains a sample of grocery items and their prices advertised on their respective web sites on one specific day. We will:

Inspect the data set, then explain why this is an example of matched pairs data.
Compute summary statistics of the prices for each store.
Conduct a permutation test to determine whether or not there is a difference in the mean prices.
Create a ~~histogram~~ bar-chart of the difference in prices. What is unusual about Quaker Oats Life cereal?
Redo the hypothesis test without this observation. Would we reach the same conclusion?

4.1 Inspecting and Charting Data

data("Groceries")
Groceries <- Groceries %>%
  mutate(Product = stringr::str_squish(Product)) # Knock off extra spaces
glimpse(Groceries)
Groceries_inspect <- inspect(Groceries)
Groceries_inspect$categorical
Groceries_inspect$quantitative

Rows: 30
Columns: 4
$ Product <chr> "Kellogg NutriGrain Bars", "Quaker Oats Life Cereal Original",…
$ Size    <fct> 8 bars, 18oz, 11.50z, 18oz, 14.3oz , 13oz, 10oz, 21oz, 10 coun…
$ Target  <dbl> 2.50, 3.19, 3.19, 2.82, 2.99, 2.64, 3.99, 4.79, 1.49, 3.49, 1.…
$ Walmart <dbl> 2.78, 6.01, 2.98, 2.68, 2.98, 1.98, 2.50, 4.79, 1.28, 2.98, 1.…

There are just 30 prices for each vendor….just barely enough to get an idea of what the distribution might be. Let us convert the data to tidy long-form for convenience, and plot the prices for the products, as box plots after pivoting the data to long form, ¹ and as bar charts:

Show the Code

Groceries_long <- Groceries %>%
  pivot_longer(
    cols = c(Walmart, Target),
    names_to = "store",
    values_to = "prices"
  ) %>%
  mutate(store = as_factor(store))
Groceries_long

Let us plot histograms/densities of the two variables that we wish to compare. We will also overlay a Gaussian distribution for comparison:

ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_dhistogram(~prices,
    fill = ~store,
    alpha = 0.5,
    title = "Grocery Costs"
  ) %>%
  gf_facet_grid(~store) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_theme(theme(legend.position = "none"))

ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_density(~prices,
    fill = ~store,
    alpha = 0.5,
    title = "Grocery Costs"
  ) %>%
  gf_facet_grid(~store) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_theme(theme(legend.position = "none"))

Apart from the fact that prices cannot be negative, the distributions are otherwise not at all close to the Gaussian…there is clearly some skew to the right, with some items being very costly compared to the rest. More when we check the assumptions on data for the tests.

How about price differences, what we are interested in?

ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_boxplot(prices ~ store,
    fill = ~store
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1"))

ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_col(fct_reorder(Product, prices) ~ prices,
    fill = ~store,
    # alpha = 0.5,
    position = "dodge",
    xlab = "Prices",
    ylab = "",
    title = "Groceries Costs"
  ) %>%
  gf_col(
    data =
      Groceries_long %>%
        filter(
          Product == "Quaker Oats Life Cereal Original"
        ),
    fct_reorder(Product, prices) ~ prices,
    fill = ~store,
    position = "dodge"
  ) %>%
  gf_refine(scale_fill_brewer(name = "Store", palette = "Set1"))

Note

We see that the price difference between Walmart and Target prices is highest for the Product named Quaker Oats Life Cereal Original. Apart from this Product, the rest have no discernible trend either way. Let us check observed statistic (the mean difference in prices)

obs_diff_price <- diffmean(prices ~ store,
  data = Groceries_long,
  only.2 = FALSE
)
# Can also use
# obs_diff_price <-  mean( ~ Walmart - Target, data = Groceries)
obs_diff_price

  diffmean 
0.05666667

4.2 Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in prices, per product between the two Store chains. So accordingly:

\[ H_0: \mu_{Walmart} = \mu_{Target}\\ \]

\[ H_a: \mu_{Walmart} \ne \mu_{Target}\ \]

4.3 Testing for Assumptions on the Data

There are a few checks we need to make of our data, to decide what test procedure to use.

4.4 A. Check for Normality

shapiro.test(Groceries$Walmart)
shapiro.test(Groceries$Target)


    Shapiro-Wilk normality test

data:  Groceries$Walmart
W = 0.78662, p-value = 3.774e-05


    Shapiro-Wilk normality test

data:  Groceries$Target
W = 0.79722, p-value = 5.836e-05

For both tests, we see that the p.value is very small, indicating that the data are unlikely to be normally distributed. This means we cannot apply a standard paired t.test and need to use the non-parametric wilcox.test, that does not rely on the assumption of normality.

4.5 B. Check for Variances

Let us check if the two variables have similar variances:

var.test(Groceries$Walmart, Groceries$Target)


    F test to compare two variances

data:  Groceries$Walmart and Groceries$Target
F = 0.97249, num df = 29, denom df = 29, p-value = 0.9406
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4628695 2.0431908
sample estimates:
ratio of variances 
         0.9724868

It appears from the \(p.value = 0.9\) and the \(Confidence Interval = [0.4629, 2.0432]\), which includes \(1\), that we cannot reject the NULL Hypothesis that the variances are not significantly different. ( Remember: this test looks at the ratio of the two variances, and the NULL Hypothesis is that this ratio is \(1\) ).

4.6 Inference

Well, the variables are not normally distributed, so a standard t.test is not advised, even if the variances are similar. We can still try:

mosaic::t_test(Groceries$Walmart, Groceries$Target, paired = TRUE) %>%
  broom::tidy()

The p.value is \(0.64\) ! And the Confidence Interval straddles \(0\). So the t.test gives us no reason to reject the Null Hypothesis that the means are similar. But can we really believe this, given the non-normality of data?

However, we have seen that the data variables are not normally distributed. So a Wilcoxon Test, using signed-ranks, is indicated: (recall the model!)

# For stability reasons, it may be advisable to use rounded data or to set digits.rank = 7, say,
# such that determination of ties does not depend on very small numeric differences (see the example).

wilcox.test(Groceries$Walmart, Groceries$Target,
  data = Groceries_long,
  digits.rank = 7, paired = TRUE,
  conf.int = TRUE, conf.level = 0.95
) %>%
  broom::tidy()

The Wilcoxon test result is very interesting: the p.value says there is a significant difference between the two store prices, and the confidence.interval also is unipolar…

As before we can do the linear model for both the original data and the sign.rank data. The test statistic is again the difference between thetwo variables:

lm(Target - Walmart ~ 1, data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

# Create a sign-rank function
signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

lm(signed_rank(Target - Walmart) ~ 1,
  data = Groceries
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

Very interesting results, but confirming what we saw earlier: The Linear Model with the original data reports no significant difference, but the linear model with sign-ranks, suggests there is a significant difference in means prices between stores!

Let us perform the pair-wise permutation test on prices, by shuffling the two store names:

polarity <- c(rep(1, 15), rep(-1, 15))
##
null_dist_price <- do(4999) *
  mean(
    data = Groceries,
    ~ (Target - Walmart) *
      resample(polarity, replace = TRUE)
  )
null_dist_price

Show the Code

ggplot2::theme_set(new = theme_custom())
gf_histogram(data = null_dist_price, ~mean) %>%
  gf_vline(xintercept = obs_diff_price, colour = "red") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
###
gf_ecdf(data = null_dist_price, ~mean, linewidth = 1) %>%
  gf_vline(
    xintercept = obs_diff_price,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Cumulative Density"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
prop1(~ mean <= obs_diff_price, data = null_dist_price)

prop_TRUE 
   0.6518

Does not seem to be any significant difference in prices…

4.7 All Tests Together

We can put all the test results together to get a few more insights about the tests:

Show the Code

mosaic::t_test(Groceries$Walmart, Groceries$Target, paired = TRUE) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "cyan"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "t.test")
###
lm(Target - Walmart ~ 1, data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "cyan"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model")
###
wilcox.test(Groceries$Walmart, Groceries$Target,
  digits.rank = 7,
  paired = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "palegreen"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Wilcoxon Test")
###
lm(signed_rank(Target - Walmart) ~ 1,
  data = Groceries
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "palegreen"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model with Sign.Ranks")

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
t.test
-0.05666667	-0.4704556	0.6415488	29	-0.3030159	0.1896825	Paired t-test	two.sided

term	estimate	std.error	statistic	p.value	conf.low	conf.high
Linear Model
(Intercept)	0.05666667	0.1204506	0.4704556	0.6415488	-0.1896825	0.3030159

estimate	statistic	p.value	conf.low	conf.high	method	alternative
Wilcoxon Test
-0.104966	95	0.01431746	-0.1750051	-0.03005987	Wilcoxon signed rank test with continuity correction	two.sided

term	estimate	std.error	statistic	p.value	conf.low	conf.high
Linear Model with Sign.Ranks
(Intercept)	8.533333	2.888834	2.953902	0.006167464	2.625004	14.44166

Clearly, the parametric tests do not detect a significant difference in prices, whereas the non-parametric tests do.

Suppose we knock off the Quaker Cereal data item…(note the spaces in the product name)

Groceries_less <- Groceries %>%
  filter(Product != "Quaker Oats Life Cereal Original")
##
Groceries_less_long <- Groceries_less %>%
  pivot_longer(
    cols = c(Target, Walmart),
    names_to = "store",
    values_to = "prices"
  )
##
wilcox.test(Groceries_less$Walmart,
  Groceries_less$Target,
  paired = TRUE, digits.rank = 7,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

obs_diff_price_less <-
  mean(~ (Target - Walmart), data = Groceries_less)
obs_diff_price_less

[1] 0.1558621

Show the Code

ggplot2::theme_set(new = theme_custom())

set.seed(12345)
polarity_less <- c(rep(1, 15), rep(-1, 14))
# Due to resampling this small bias makes no difference
null_dist_price_less <-
  do(9999) * mean(
    data = Groceries_less,
    ~ (Target - Walmart) * resample(polarity_less, replace = TRUE)
  )
##
gf_histogram(data = null_dist_price_less, ~mean) %>%
  gf_vline(
    xintercept = obs_diff_price_less,
    colour = "red"
  ) %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
##
###
gf_ecdf(data = null_dist_price_less, ~mean, linewidth = 1) %>%
  gf_vline(
    xintercept = obs_diff_price_less,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Cumulative Density"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
mean(null_dist_price_less >= obs_diff_price_less)

[1] 0.01370137

We see that removing the Quaker Oats product item from the data does give a significant difference in mean prices !!! That one price difference was in the opposite direction compared to the general trend in differences, so when it was removed, we obtained a truer picture of price differences.

Try to do a regular parametric t.test with this reduced data!

5 Wait, But Why?

Paired data is a special case of dependent data, where the observations are paired in some way. This can be due to repeated measures on the same subjects, or matched subjects in a case-control study. The paired t-test and the Wilcoxon signed-rank test are both designed for paired data, and take into account the fact that the observations are not independent.

Can you think of an underlying experiment where the data may be paired?

6 Conclusion

We have learnt how to perform inference for paired-means. We have looked at the conditions that make the regular t.test possible, and learnt what to do if the conditions of normality and equal variance are not met. We have also looked at how these tests can be understood as manifestations of the linear model, with data and sign-ranked data. It should also be fairly clear now that we can test for the equivalence of two paired means, using a very simple permutation tests. Given computing power, we can always mechanize this test very quickly to get our results. And that performing this test yields reliable results without having to rely on any assumption relating to underlying distributions and so on.

7 Your Turn

Try the datasets in the openintro package. Use data(package = "openintro") in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g. babynames) to read it.
Same with the resampledata and resampledata3 packages.
Try the datasets in the PairedData package. Yes, install this too, peasants.

8 References

Paired Independence test with the package infer: https://infer.netlify.app/articles/paired
Randall Pruim, Nicholas J. Horton, Daniel T. Kaplan, StartTeaching with R
https://bcs.wiley.com/he-bcs/Books?action=index&itemId=111941654X&bcsId=11307
https://statsandr.com/blog/wilcoxon-test-in-r-how-to-compare-2-groups-under-the-non-normality-assumption/

R Package Citations

Package	Version	Citation
gt	1.1.0	@gt
infer	1.0.9	@infer
MKinfer	1.2	@MKinfer
openintro	2.5.0	@openintro

Footnotes

https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/tidyr-pivoting.gif ↩︎