The Mad Hatter’s Guide to Data Viz and Stats in R
  1. Data Viz and Stats
  2. Inference
  3. Inference for Comparing Two Paired Means
  • Data Viz and Stats
    • Tools
      • Introduction to R and RStudio
    • Descriptive Analytics
      • Data
      • Inspect Data
      • Graphs
      • Summaries
      • Counts
      • Quantities
      • Groups
      • Distributions
      • Groups and Distributions
      • Change
      • Proportions
      • Parts of a Whole
      • Evolution and Flow
      • Ratings and Rankings
      • Surveys
      • Time
      • Space
      • Networks
      • Miscellaneous Graphing Tools, and References
    • Inference
      • Basics of Statistical Inference
      • 🎲 Samples, Populations, Statistics and Inference
      • Basics of Randomization Tests
      • Inference for a Single Mean
      • Inference for Two Independent Means
      • Inference for Comparing Two Paired Means
      • Comparing Multiple Means with ANOVA
      • Inference for Correlation
      • Testing a Single Proportion
      • Inference Test for Two Proportions
    • Modelling
      • Modelling with Linear Regression
      • Modelling with Logistic Regression
      • 🕔 Modelling and Predicting Time Series
    • Workflow
      • Facing the Abyss
      • I Publish, therefore I Am
      • Data Carpentry
    • Arts
      • Colours
      • Fonts in ggplot
      • Annotating Plots: Text, Labels, and Boxes
      • Annotations: Drawing Attention to Parts of the Graph
      • Highlighting parts of the Chart
      • Changing Scales on Charts
      • Assembling a Collage of Plots
      • Making Diagrams in R
    • AI Tools
      • Using gander and ellmer
      • Using Github Copilot and other AI tools to generate R code
      • Using LLMs to Explain Stat models
    • Case Studies
      • Demo:Product Packaging and Elderly People
      • Ikea Furniture
      • Movie Profits
      • Gender at the Work Place
      • Heptathlon
      • School Scores
      • Children's Games
      • Valentine’s Day Spending
      • Women Live Longer?
      • Hearing Loss in Children
      • California Transit Payments
      • Seaweed Nutrients
      • Coffee Flavours
      • Legionnaire’s Disease in the USA
      • Antarctic Sea ice
      • William Farr's Observations on Cholera in London
    • Projects
      • Project: Basics of EDA #1
      • Project: Basics of EDA #2
      • Experiments

On this page

  • 1 Setting up R Packages
  • 2 Introduction to Inference for Paired Data
    • 2.1 What is Paired Data?
    • 2.2 Workflow for Inference for Paired Means
  • 3 Case Study #1: Results from a Diving Championship
    • 3.1 Inspecting and Charting Data
    • 3.2 A. Check for Normality
    • 3.3 B. Check for Variances
    • 3.4 Hypothesis
    • 3.5 Observed and Test Statistic
    • 3.6 Inference
    • 3.7 All Tests Together
  • 4 Case Study #2: Walmart vs Target
    • 4.1 Inspecting and Charting Data
    • 4.2 Hypothesis
    • 4.3 Testing for Assumptions on the Data
    • 4.4 A. Check for Normality
    • 4.5 B. Check for Variances
    • 4.6 Inference
    • 4.7 All Tests Together
  • 5 Wait, But Why?
  • 6 Conclusion
  • 7 Your Turn
  • 8 References
  1. Data Viz and Stats
  2. Inference
  3. Inference for Comparing Two Paired Means

Inference for Comparing Two Paired Means

A pig don’t get fatter, the more you weigh it!

Author

Arvind V.

Published

November 10, 2022

Modified

September 30, 2025

Abstract
Means from repeated measurements

1 Setting up R Packages

library(tidyverse)
library(mosaic)
library(broom) # Tidy Test data
library(resampledata3) # Datasets from Chihara and Hesterberg's book
library(gt) # for tables
##
library(visStatistics) # All-in-one Stats test package

Plot Fonts and Theme

Show the Code
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 8
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

2 Introduction to Inference for Paired Data

2.1 What is Paired Data?

Sometimes the data is collected on the same set of individual categories, e.g. scores by the same sport persons in two separate tournaments, or sales of identical items in two separate locations of a chain store. Or say the number of customers in the morning and in the evening, at a set of store locations. In this case we treat the two sets of observations as paired, since they correspond to the same set of observable entities (Qual !!) . This is how some experiments give us paired data.

We would naturally be interested in the differences in means across these two sets, which exploits this paired nature. In this module, we will examine tests for this purpose.

2.2 Workflow for Inference for Paired Means

flowchart TD
    A[Inference for Paired Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test with paired=TRUE]
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction, paired =TRUE]
    C -->|No\n Non-Parametric| E[wilcox.test with paired = TRUE]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
 

We will now use a couple to case studies to traverse all the possible pathways in the Workflow above.

3 Case Study #1: Results from a Diving Championship

Here we have swimming records across a Semi-Final and a Final:

3.1 Inspecting and Charting Data

data("Diving2017", package = "resampledata3")
Diving2017
Diving2017_inspect <- inspect(Diving2017)
Diving2017_inspect$categorical
Diving2017_inspect$quantitative

The data is made up of paired observations per swimmer, one for the semi-final and one for the final race. There are 12 swimmers and therefore 12 paired records. How can we quickly visualize this data?

Let us first make this data into long form:

Diving2017_long <- Diving2017 %>%
  pivot_longer(
    cols = c(Final, Semifinal),
    names_to = "race",
    values_to = "scores"
  )
Diving2017_long

Next, histograms and densities of the two variables at hand:

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_density(~scores,
    fill = ~race,
    alpha = 0.5,
    title = "Diving Scores"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_facet_grid(~race) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_theme(theme(legend.position = "none")) # No Need!!

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_col(
    fct_reorder(Name, scores) ~ scores,
    fill = ~race,
    alpha = 0.5,
    position = "dodge",
    xlab = "Scores",
    ylab = "Name",
    title = "Diving Scores"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1"))

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_boxplot(
    scores ~ race,
    fill = ~race,
    alpha = 0.5,
    xlab = "Race",
    ylab = "Scores",
    title = "Diving Scores", orientation = "x"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_theme(theme(legend.position = "none")) # No Need

We see that:

  • The data are not normally distributed. With just such few readings (n < 30) it was just possible…more readings would have helped. We will verify this aspect formally very shortly.
  • There is no immediately identifiable trend in score changes from one race to the other.
  • Although the two medians appear to be different, the box plots overlap considerably. So one cannot visually conclude that the two sets of race timings have different means.

3.2 A. Check for Normality

Let us also complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution.

shapiro.test(Diving2017$Final)
shapiro.test(Diving2017$Semifinal)

    Shapiro-Wilk normality test

data:  Diving2017$Final
W = 0.9184, p-value = 0.273

    Shapiro-Wilk normality test

data:  Diving2017$Semifinal
W = 0.86554, p-value = 0.05738

Hmmm….the Shapiro-Wilk test suggests that both scores are normally distributed (!!!), though Semifinal is probably marginally so.

Can we check this comparison also with plots? We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

ggplot2::theme_set(new = theme_custom())

Diving2017_long %>%
  gf_qq(~ scores | race, size = 2, colour = ~race) %>%
  gf_qqline(
    ylab = "scores", colour = ~race,
    xlab = "theoretical normal", title = "Q-Q Plots for Paired Race Data"
  ) %>%
  gf_theme(theme(legend.position = "none"))

We see in the QQ-plots that the Final scores are closer to the straight line than the Semifinal scores. But it is perhaps still hard to accept the data as normally distributed…hmm.

3.3 B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(scores ~ race,
  data = Diving2017_long,
  ratio = 1, # What we believe
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

The variances are not significantly different, as seen by the \(p.value = 0.08\).

So to summarise our data checks:

NoteConditions
  • data are normally distributed
  • variances are not significantly different

3.4 Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in performance between the population of swimmers who might have participated in these two races. So accordingly:

\[ H_0: \mu_{semifinal} = \mu_{final}\\ \]

\[ H_a: \mu_{semifinal} \ne \mu_{final}\ \]

3.5 Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, with the argument only.2 = FALSE to allow for paired data:

obs_diff_swim <- diffmean(scores ~ race,
  data = Diving2017_long,
  only.2 = FALSE
) # paired data

# Can use this also
# formula method is better for permutation test!
# obs_diff_swim <- mean(~ (Final - Semifinal), data = Diving2017)

obs_diff_swim
diffmean 
 -11.975 

3.6 Inference

  • Paired t.test
  • Paired Wilcoxon test
  • Linear Model Method
  • Permutation Tests

Since the data variables satisfy the assumption of being normally distributed, and the variances are not significantly different, we may attempt the classical t.test with paired data. (we will use the mosaic variant). Type help(t.test) in your Console. Our model would be:

\[ mean(Final(i) - Semi\_final(i)) = \beta_0 \\ \]

And that: \[ H_0: \mu_{final} - \mu_{semifinal} = 0; \] \[ H_a: \mu_{final} - \mu_{semifinal} \ne 0;\\ \]

mosaic::t.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE, var.equal = FALSE
) %>% broom::tidy()

The confidence interval spans the zero value, and the p.value is a high \(0.259\), so there is no reason to accept alternative hypothesis that the means are different. Hence we say that there is no evidence of a difference between SemiFinal and Final scores.

Well, we might consider ( based on knowledge of the sport ) that at least one of the variables does not meet the normality criteria, and though their variances are not significantly different. So we would attempt a non-parametric Wilcoxon test, that uses the signed-rank of the paired data differences, instead of the data variables. Our model would be:

\[ mean(\ sign.rank[\ Final(i) - Semifinal(i)\ ]\ ) = \beta_0 \\ \] \[ H_0: \mu_{final} - \mu_{semifinal} = 0; \] \[ H_a: \mu_{final} - \mu_{semifinal} \ne 0;\\ \]

wilcox.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  mu = 0, # belief
  alternative = "two.sided", # difference either way
  paired = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

Here also with the p.value being \(0.3804\), we have no reason to accept the Alternative Hypothesis. The parametric t.test and the non-parametric wilcox.test agree in their inferences.

We can apply the linear-model-as-inference interpretation both to the original data and to the sign.rank data: 

\[ lm(y_i - x_i \sim 1) = \beta_0 \]

and

\[ lm(\ sign.rank[\ Final(i) - Semifinal(i)\ ] \sim 1) = \beta_0\\ \]

And the Hypothesis for both interpretations would be:
\[ H_0: \beta_0 = 0;\\ \\\ H_a: \beta_0 \ne 0\\ \]

lm(Semifinal - Final ~ 1, data = Diving2017) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

# Create a sign-rank function
signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

lm(signed_rank(Semifinal - Final) ~ 1,
  data = Diving2017
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

We observe that using the linear model method for the original scores and the sign-rank scores both do not permit us to reject the \(H_0\) Null Hypothesis, since p.values are high, and the confidence.intervals straddle \(0\).

For the specific data at hand, we need to shuffle the records between Semifinal and Final on a per Swimmer basis (paired data!!) and take the test statistic (difference between the two swim records for each swimmer). Another way to look at this is to take the differences between Semifinal and Final scores and shuffle the differences to either polarity. We will follow this method in the code below:

Show the Code
polarity <- c(rep(1, 6), rep(-1, 6))
# 12 +/- 1s,
# 6 each to make sure there is equal probability
polarity
##
null_dist_swim <- do(4999) *
  mean(
    data = Diving2017,
    ~ (Final - Semifinal) * # take (pairwise) differences
      mosaic::resample(polarity, # Swap polarity randomly
        replace = TRUE
      )
  )
##
null_dist_swim
 [1]  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1

Let us plot the NULL distribution and compare it with the actual observed differences in the race times:

Show the Code
ggplot2::theme_set(new = theme_custom())

gf_histogram(data = null_dist_swim, ~mean) %>%
  gf_vline(
    xintercept = obs_diff_swim,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Histogram"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
###
gf_ecdf(data = null_dist_swim, ~mean, linewidth = 1) %>%
  gf_vline(
    xintercept = obs_diff_swim,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Cumulative Density"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
###
prop1(~ mean <= obs_diff_swim, data = null_dist_swim)

prop_TRUE 
   0.1346 

Hmm…so by generating 4999 shuffles of score-difference polarities, it does appear that we can not only obtain the current observed difference but even surpass it frequently. So it does seem that there is no difference in means between Semi-Final and Final swimming scores.

3.7 All Tests Together

We can put all the test results together to get a few more insights about the tests:

Show the Code
mosaic::t.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "t.test")

lm(Semifinal - Final ~ 1, data = Diving2017) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model")

wilcox.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Wilcoxon test")

lm(signed_rank(Semifinal - Final) ~ 1,
  data = Diving2017
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model with sign.rank")
t.test
estimate statistic p.value parameter conf.low conf.high method alternative
-11.975 -1.190339 0.2589684 11 -34.11726 10.16726 Paired t-test two.sided
Linear Model
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -11.975 10.06016 -1.190339 0.2589684 -34.11726 10.16726
Wilcoxon test
statistic p.value method alternative
27 0.3803711 Wilcoxon signed rank exact test two.sided
Linear Model with sign.rank
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2 2.135558 -0.9365236 0.3691097 -6.70033 2.70033

The linear model and the t.test are nearly identical in performance; the p.values are the same. The same is also true of the wilcox.test and the linear model with sign-rank data differences. This is of course not surprising!

4 Case Study #2: Walmart vs Target

Is there a difference in the price of Groceries sold by the two retailers Target and Walmart? The data set Groceries contains a sample of grocery items and their prices advertised on their respective web sites on one specific day. We will:

  1. Inspect the data set, then explain why this is an example of matched pairs data.
  2. Compute summary statistics of the prices for each store.
  3. Conduct a permutation test to determine whether or not there is a difference in the mean prices.
  4. Create a histogram bar-chart of the difference in prices. What is unusual about Quaker Oats Life cereal?
  5. Redo the hypothesis test without this observation. Would we reach the same conclusion?

4.1 Inspecting and Charting Data

data("Groceries")
Groceries <- Groceries %>%
  mutate(Product = stringr::str_squish(Product)) # Knock off extra spaces
glimpse(Groceries)
Groceries_inspect <- inspect(Groceries)
Groceries_inspect$categorical
Groceries_inspect$quantitative
Rows: 30
Columns: 4
$ Product <chr> "Kellogg NutriGrain Bars", "Quaker Oats Life Cereal Original",…
$ Size    <fct> 8 bars, 18oz, 11.50z, 18oz, 14.3oz , 13oz, 10oz, 21oz, 10 coun…
$ Target  <dbl> 2.50, 3.19, 3.19, 2.82, 2.99, 2.64, 3.99, 4.79, 1.49, 3.49, 1.…
$ Walmart <dbl> 2.78, 6.01, 2.98, 2.68, 2.98, 1.98, 2.50, 4.79, 1.28, 2.98, 1.…

There are just 30 prices for each vendor….just barely enough to get an idea of what the distribution might be. Let us convert the data to tidy long-form for convenience, and plot the prices for the products, as box plots after pivoting the data to long form, 1 and as bar charts:

Show the Code
Groceries_long <- Groceries %>%
  pivot_longer(
    cols = c(Walmart, Target),
    names_to = "store",
    values_to = "prices"
  ) %>%
  mutate(store = as_factor(store))
Groceries_long

Let us plot histograms/densities of the two variables that we wish to compare. We will also overlay a Gaussian distribution for comparison:

ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_dhistogram(~prices,
    fill = ~store,
    alpha = 0.5,
    title = "Grocery Costs"
  ) %>%
  gf_facet_grid(~store) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_theme(theme(legend.position = "none"))

ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_density(~prices,
    fill = ~store,
    alpha = 0.5,
    title = "Grocery Costs"
  ) %>%
  gf_facet_grid(~store) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_theme(theme(legend.position = "none"))

Apart from the fact that prices cannot be negative, the distributions are otherwise not at all close to the Gaussian…there is clearly some skew to the right, with some items being very costly compared to the rest. More when we check the assumptions on data for the tests.

How about price differences, what we are interested in?

ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_boxplot(prices ~ store,
    fill = ~store
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1"))
Error in `position_dodge()`:
! `orientation` must be a string or character vector.
ggplot2::theme_set(new = theme_custom())

Groceries_long %>%
  gf_col(fct_reorder(Product, prices) ~ prices,
    fill = ~store,
    # alpha = 0.5,
    position = "dodge",
    xlab = "Prices",
    ylab = "",
    title = "Groceries Costs"
  ) %>%
  gf_col(
    data =
      Groceries_long %>%
        filter(
          Product == "Quaker Oats Life Cereal Original"
        ),
    fct_reorder(Product, prices) ~ prices,
    fill = ~store,
    position = "dodge"
  ) %>%
  gf_refine(scale_fill_brewer(name = "Store", palette = "Set1"))

Note

We see that the price difference between Walmart and Target prices is highest for the Product named Quaker Oats Life Cereal Original. Apart from this Product, the rest have no discernible trend either way. Let us check observed statistic (the mean difference in prices)

obs_diff_price <- diffmean(prices ~ store,
  data = Groceries_long,
  only.2 = FALSE
)
# Can also use
# obs_diff_price <-  mean( ~ Walmart - Target, data = Groceries)
obs_diff_price
  diffmean 
0.05666667 

4.2 Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in prices, per product between the two Store chains. So accordingly:

\[ H_0: \mu_{Walmart} = \mu_{Target}\\ \]

\[ H_a: \mu_{Walmart} \ne \mu_{Target}\ \]

4.3 Testing for Assumptions on the Data

There are a few checks we need to make of our data, to decide what test procedure to use.

4.4 A. Check for Normality

shapiro.test(Groceries$Walmart)
shapiro.test(Groceries$Target)

    Shapiro-Wilk normality test

data:  Groceries$Walmart
W = 0.78662, p-value = 3.774e-05

    Shapiro-Wilk normality test

data:  Groceries$Target
W = 0.79722, p-value = 5.836e-05

For both tests, we see that the p.value is very small, indicating that the data are unlikely to be normally distributed. This means we cannot apply a standard paired t.test and need to use the non-parametric wilcox.test, that does not rely on the assumption of normality.

4.5 B. Check for Variances

Let us check if the two variables have similar variances:

var.test(Groceries$Walmart, Groceries$Target)

    F test to compare two variances

data:  Groceries$Walmart and Groceries$Target
F = 0.97249, num df = 29, denom df = 29, p-value = 0.9406
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4628695 2.0431908
sample estimates:
ratio of variances 
         0.9724868 

It appears from the \(p.value = 0.9\) and the \(Confidence Interval = [0.4629, 2.0432]\), which includes \(1\), that we cannot reject the NULL Hypothesis that the variances are not significantly different. ( Remember: this test looks at the ratio of the two variances, and the NULL Hypothesis is that this ratio is \(1\) ).

4.6 Inference

  • Using paired t.test
  • Using non-parametric paired Wilcoxon test
  • Using the Linear Model Method
  • Permutation Tests

Well, the variables are not normally distributed, so a standard t.test is not advised, even if the variances are similar. We can still try:

mosaic::t_test(Groceries$Walmart, Groceries$Target, paired = TRUE) %>%
  broom::tidy()

The p.value is \(0.64\) ! And the Confidence Interval straddles \(0\). So the t.test gives us no reason to reject the Null Hypothesis that the means are similar. But can we really believe this, given the non-normality of data?

However, we have seen that the data variables are not normally distributed. So a Wilcoxon Test, using signed-ranks, is indicated: (recall the model!)

# For stability reasons, it may be advisable to use rounded data or to set digits.rank = 7, say,
# such that determination of ties does not depend on very small numeric differences (see the example).

wilcox.test(Groceries$Walmart, Groceries$Target,
  data = Groceries_long,
  digits.rank = 7, paired = TRUE,
  conf.int = TRUE, conf.level = 0.95
) %>%
  broom::tidy()

The Wilcoxon test result is very interesting: the p.value says there is a significant difference between the two store prices, and the confidence.interval also is unipolar…

As before we can do the linear model for both the original data and the sign.rank data. The test statistic is again the difference between thetwo variables:

lm(Target - Walmart ~ 1, data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

# Create a sign-rank function
signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

lm(signed_rank(Target - Walmart) ~ 1,
  data = Groceries
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

Very interesting results, but confirming what we saw earlier: The Linear Model with the original data reports no significant difference, but the linear model with sign-ranks, suggests there is a significant difference in means prices between stores!

Let us perform the pair-wise permutation test on prices, by shuffling the two store names:

polarity <- c(rep(1, 15), rep(-1, 15))
##
null_dist_price <- do(4999) *
  mean(
    data = Groceries,
    ~ (Target - Walmart) *
      resample(polarity, replace = TRUE)
  )
null_dist_price
Show the Code
ggplot2::theme_set(new = theme_custom())
gf_histogram(data = null_dist_price, ~mean) %>%
  gf_vline(xintercept = obs_diff_price, colour = "red") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
###
gf_ecdf(data = null_dist_price, ~mean, linewidth = 1) %>%
  gf_vline(
    xintercept = obs_diff_price,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Cumulative Density"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
prop1(~ mean <= obs_diff_price, data = null_dist_price)

prop_TRUE 
   0.6518 

Does not seem to be any significant difference in prices…

4.7 All Tests Together

We can put all the test results together to get a few more insights about the tests:

Show the Code
mosaic::t_test(Groceries$Walmart, Groceries$Target, paired = TRUE) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "cyan"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "t.test")
###
lm(Target - Walmart ~ 1, data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "cyan"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model")
###
wilcox.test(Groceries$Walmart, Groceries$Target,
  digits.rank = 7,
  paired = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "palegreen"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Wilcoxon Test")
###
lm(signed_rank(Target - Walmart) ~ 1,
  data = Groceries
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "palegreen"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model with Sign.Ranks")
t.test
estimate statistic p.value parameter conf.low conf.high method alternative
-0.05666667 -0.4704556 0.6415488 29 -0.3030159 0.1896825 Paired t-test two.sided
Linear Model
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.05666667 0.1204506 0.4704556 0.6415488 -0.1896825 0.3030159
Wilcoxon Test
estimate statistic p.value conf.low conf.high method alternative
-0.104966 95 0.01431746 -0.1750051 -0.03005987 Wilcoxon signed rank test with continuity correction two.sided
Linear Model with Sign.Ranks
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.533333 2.888834 2.953902 0.006167464 2.625004 14.44166

Clearly, the parametric tests do not detect a significant difference in prices, whereas the non-parametric tests do.

Suppose we knock off the Quaker Cereal data item…(note the spaces in the product name)

Groceries_less <- Groceries %>%
  filter(Product != "Quaker Oats Life Cereal Original")
##
Groceries_less_long <- Groceries_less %>%
  pivot_longer(
    cols = c(Target, Walmart),
    names_to = "store",
    values_to = "prices"
  )
##
wilcox.test(Groceries_less$Walmart,
  Groceries_less$Target,
  paired = TRUE, digits.rank = 7,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()
obs_diff_price_less <-
  mean(~ (Target - Walmart), data = Groceries_less)
obs_diff_price_less
[1] 0.1558621
Show the Code
ggplot2::theme_set(new = theme_custom())

set.seed(12345)
polarity_less <- c(rep(1, 15), rep(-1, 14))
# Due to resampling this small bias makes no difference
null_dist_price_less <-
  do(9999) * mean(
    data = Groceries_less,
    ~ (Target - Walmart) * resample(polarity_less, replace = TRUE)
  )
##
gf_histogram(data = null_dist_price_less, ~mean) %>%
  gf_vline(
    xintercept = obs_diff_price_less,
    colour = "red"
  ) %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
##
###
gf_ecdf(data = null_dist_price_less, ~mean, linewidth = 1) %>%
  gf_vline(
    xintercept = obs_diff_price_less,
    colour = "red",
    linewidth = 1, title = "Null Distribution by Permutation",
    subtitle = "Cumulative Density"
  ) %>%
  gf_labs(x = "Difference in Means") %>%
  gf_refine(scale_y_continuous(expand = c(0, 0)))
mean(null_dist_price_less >= obs_diff_price_less)

[1] 0.01370137

We see that removing the Quaker Oats product item from the data does give a significant difference in mean prices !!! That one price difference was in the opposite direction compared to the general trend in differences, so when it was removed, we obtained a truer picture of price differences.

Try to do a regular parametric t.test with this reduced data!

5 Wait, But Why?

Paired data is a special case of dependent data, where the observations are paired in some way. This can be due to repeated measures on the same subjects, or matched subjects in a case-control study. The paired t-test and the Wilcoxon signed-rank test are both designed for paired data, and take into account the fact that the observations are not independent.

Can you think of an underlying experiment where the data may be paired?

6 Conclusion

We have learnt how to perform inference for paired-means. We have looked at the conditions that make the regular t.test possible, and learnt what to do if the conditions of normality and equal variance are not met. We have also looked at how these tests can be understood as manifestations of the linear model, with data and sign-ranked data. It should also be fairly clear now that we can test for the equivalence of two paired means, using a very simple permutation tests. Given computing power, we can always mechanize this test very quickly to get our results. And that performing this test yields reliable results without having to rely on any assumption relating to underlying distributions and so on.

7 Your Turn

  1. Try the datasets in the openintro package. Use data(package = "openintro") in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g. babynames) to read it.

  2. Same with the resampledata and resampledata3 packages.

  3. Try the datasets in the PairedData package. Yes, install this too, peasants.

8 References

  1. Paired Independence test with the package infer: https://infer.netlify.app/articles/paired
  2. Randall Pruim, Nicholas J. Horton, Daniel T. Kaplan, StartTeaching with R
  3. https://bcs.wiley.com/he-bcs/Books?action=index&itemId=111941654X&bcsId=11307
  4. https://statsandr.com/blog/wilcoxon-test-in-r-how-to-compare-2-groups-under-the-non-normality-assumption/
R Package Citations
Package Version Citation
gt 1.1.0 @gt
infer 1.0.9 @infer
MKinfer 1.2 @MKinfer
openintro 2.5.0 @openintro
Back to top

Footnotes

  1. https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/tidyr-pivoting.gif↩︎

Inference for Two Independent Means
Comparing Multiple Means with ANOVA

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .