Inference for Correlation

Arvind V.

2022-11-25

Setting up R packages

library(tidyverse) # Sine Qua Non
library(mosaic) # Our bag of tricks
library(broom) # Tidying model outputs
library(crosstable) # tabulated summary stats
library(openintro) # datasets and methods
library(resampledata3) # datasets
library(mosaicData) # datasets
library(statsExpressions) # datasets and methods
library(ggstatsplot) # special stats plots
library(ggExtra)

# Non-CRAN Packages
# remotes::install_github("easystats/easystats")
library(easystats)

Plot Fonts and Theme

Code
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    # theme(panel.widths = unit(11, "cm"),
    #       panel.heights = unit(6.79, "cm")) + # Golden Ratio

    theme(
      plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
      plot.background = element_rect(
        fill = "bisque",
        colour = "black",
        linewidth = 1
      )
    ) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 10
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 8
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

Introduction

Correlations define how one variables varies with another. One of the basic Questions we would have of our data is: Does some variable have a significant correlation score with another in some way? Does \(y\) vary with \(x\)? A Correlation Test is designed to answer exactly this question. The block diagram below depicts the statistical procedures available to test for the significance of correlation scores between two variables.

In this module we will explore the correlation coefficient and how to test for its significance. We will also see how to use the linear model method to perform correlation tests, and how to use the permutation test to do so without any assumptions.

Basic Definitions

Before we begin, let us recap a few basic definitions:

We have already encountered the variance of a variable:

\[ \begin{align*} var_x &= \frac{\sum_{i=1}^{n}(x_i - \mu_x)^2}{(n-1)}\\ where ~ \mu_x &= mean(x)\\ n &= sample\ size \end{align*} \] The standard deviation is:

\[ \sigma_x = \sqrt{var_x}\\ \] The covariance of two variables is defined as:

\[ \begin{align} cov(x,y) &= \frac{\sum_{i = 1}^{n}(x_i - \mu_x)*(y_i - \mu_y)}{n-1}\\ &= \frac{\sum{x_i *y_i}}{n-1} - \frac{\sum{x_i *\mu_y}}{n-1} - \frac{\sum{y_i *\mu_x}}{n-1} + \frac{\sum{\mu_x *\mu_y}}{n-1}\\ &= \frac{\sum{x_i *y_i}}{n-1} - \frac{\sum{\mu_x *\mu_y}}{n-1}\\ \end{align} \]

Hence covariance is the expectation of the product minus the product of the expectations of the two variables.

So, finally, the coefficient of correlation between two variables is defined as:

\[ \begin{align} correlation ~ r &= \frac{cov(x,y)}{\sigma_x * \sigma_y}\\ &= \frac{\sum_{i = 1}^{n}(x_i - \mu_x)*(y_i - \mu_y)}{(\sigma_x * \sigma_y)(n-1)}\\ &= \frac{\sum_{i = 1}^{n}\left(\frac{x_i - \mu_x}{\sigma_x}\right)*\left(\frac{y_i - \mu_y}{\sigma_y}\right)}{(n-1)}\\ \end{align} \tag{1}\]

which is the average of the product of the z-scores of the two variables.

Correlation uses z-scores!

Note that in both cases we are dealing with z-scores: variable minus its mean, \(\frac{x_i - \mu_x}{\sigma_x}\), which we have seen when dealing with the CLT and the Gaussian Distribution.

Case Study #1: Galton’s famous dataset

How can we start, except by using the famous Galton dataset, now part of the mosaicData package?

Workflow: Read and Inspect the Data

data("Galton", package = "mosaicData")
Galton
skimr::skim(Galton)
Data summary
Name Galton
Number of rows 898
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
family 0 1 FALSE 197 185: 15, 166: 11, 66: 11, 130: 10
sex 0 1 FALSE 2 M: 465, F: 433

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
father 0 1 69.23 2.47 62 68 69.0 71.0 78.5 ▁▅▇▂▁
mother 0 1 64.08 2.31 58 63 64.0 65.5 70.5 ▂▅▇▃▁
height 0 1 66.76 3.58 56 64 66.5 69.7 79.0 ▁▇▇▅▁
nkids 0 1 6.14 2.69 1 4 6.0 8.0 15.0 ▃▇▆▂▁

So there are several correlations we can explore here: Children’s height vs that of father or mother, based on sex. In essence we are replicating Francis Galton’s famous study.

Data Munging

Note that nkids, while coded as int, is actually a factor variable. Let us convert it to a factor:

Galton_modified <- Galton %>%
  mutate(nkids = as_factor(nkids))

Data Dictionary

Qualitative Variables

  • sex(fct): sex of the child
  • family(int): an ID for each family
  • nkids(fct): Number of children in each family

Quantitative Variables

  • father(dbl): father’s height in inches
  • mother(dbl): mother’s height in inches
  • height(dbl): Child’s height in inches

Workflow: Experiment, and Research Questions

We can say that Galton may have measured the heights of fathers/mothers and their children, and recorded the sex of the child. He may have been interested in knowing if there is a correlation between the heights of the parents and their children, and if this correlation is different for sons and daughters. Hence the children’s height is the response variable, and the father’s/mother’s heights are explanatory variables, as is sex of the child.

Question 1

  1. Based on this sample, what can we say about the correlation between a son’s height and a father’s height in the population?

Question 2

  1. Based on this sample, what can we say about the correlation between a daughter’s height and a father’s height in the population?

Of course we can formulate more questions, but these are good for now! And since we are going to infer correlations by sex, let us split the dataset into two parts, one for the sons and one for the daughters, and quickly summarise them too:

Galton_sons <- Galton_modified %>%
  dplyr::filter(sex == "M") %>%
  rename("son" = height)
Galton_daughters <- Galton_modified %>%
  dplyr::filter(sex == "F") %>%
  rename("daughter" = height)
dim(Galton_sons)
dim(Galton_daughters)
[1] 465   6

Sons Data

[1] 433   6

Daughters data

Galton_sons %>%
  summarize(across(
    .cols = c(son, father),
    .fns = list(mean = mean, sd = sd)
  ))
Galton_daughters %>%
  summarize(across(
    .cols = c(daughter, father),
    .fns = list(mean = mean, sd = sd)
  ))

Sons Summary Stats

Daughters Summary Stats

What did filtering do?

Why are father means different for sons and daughters?? When we filtered the dataset into two, the filtering by sex of the child also effectively filtered the heights of the father (and mother). This is proper and desired; but think!

Workflow: Visualization

Let us first quickly plot a graph that is relevant to each of the two research questions.

Code
ggplot2::theme_set(new = theme_custom())

Galton_sons %>%
  gf_point(son ~ father) %>%
  gf_lm() %>%
  gf_labs(
    x = "Father's Height", y = "Son's Height",
    title = "Heights: Sons vs Fathers",
    subtitle = "Galton dataset"
  )
##
Galton_daughters %>%
  gf_point(daughter ~ father) %>%
  gf_lm() %>%
  gf_labs(
    x = "Father's Height", y = "Daughter's Height",
    title = "Heights: Daughters vs Fathers",
    subtitle = "Galton dataset"
  )
(a) Sons
(b) Daughters
Figure 1: Heights of Sons and Daughters vs Fathers

We might even plot the overall heights together and colour by sex of the child:

ggplot2::theme_set(new = theme_custom())

Galton %>%
  gf_point(height ~ father,
    group = ~sex, colour = ~sex
  ) %>%
  gf_lm() %>%
  gf_refine(scale_color_brewer(palette = "Set1")) %>%
  gf_labs(
    x = "Father's Height", y = "Children's Height",
    title = "Heights: Children vs Fathers",
    subtitle = "Galton dataset"
  )

So daughters are shorter than sons, generally speaking, and both sets of heights seem related to that of the father.

Workflow: Assumptions

For the classical correlation tests, we need that the variables are normally distributed. As before we check this with the shapiro.test:

shapiro.test(Galton_sons$father)
shapiro.test(Galton_sons$son)
##
shapiro.test(Galton_daughters$father)
shapiro.test(Galton_daughters$daughter)

    Shapiro-Wilk normality test

data:  Galton_sons$father
W = 0.98529, p-value = 0.0001191

    Shapiro-Wilk normality test

data:  Galton_sons$son
W = 0.99135, p-value = 0.008133

    Shapiro-Wilk normality test

data:  Galton_daughters$father
W = 0.98438, p-value = 0.0001297

    Shapiro-Wilk normality test

data:  Galton_daughters$daughter
W = 0.99113, p-value = 0.01071

Let us also check the densities and quartile plots of the heights in the dataset:

Code
ggplot2::theme_set(new = theme_custom())

Galton %>%
  group_by(sex) %>%
  gf_density(~height,
    group = ~sex,
    fill = ~sex
  ) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(title = "Facetted Density Plots") %>%
  gf_theme(legend.position = "none") # Think!
##
Galton %>%
  group_by(sex) %>%
  gf_qq(~height,
    group = ~sex,
    colour = ~sex, size = 0.5
  ) %>%
  gf_qqline(colour = "black") %>%
  gf_refine(scale_color_brewer(palette = "Set1")) %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(
    title = "Facetted QQ Plots",
    x = "Theoretical quartiles",
    y = "Actual Data"
  ) %>%
  gf_theme(legend.position = "none") # Think!

and the father’s heights:

Code
ggplot2::theme_set(new = theme_custom())

Galton %>%
  group_by(sex) %>%
  gf_density(~father,
    group = ~sex, # no this is not weird
    fill = ~sex
  ) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(name = "Sex of Child", palette = "Set1")) %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(
    title = "Fathers: Facetted Density Plots",
    subtitle = "By Sex of Child"
  ) %>%
  gf_theme(legend.position = "none") # Think!
Galton %>%
  group_by(sex) %>%
  gf_qq(~father,
    group = ~sex, # no this is not weird
    colour = ~sex, size = 0.5
  ) %>%
  gf_qqline(colour = "black") %>%
  gf_facet_grid(vars(sex)) %>%
  gf_refine(scale_colour_brewer(name = "Sex of Child", palette = "Set1")) %>%
  gf_labs(
    title = "Fathers Heights: Facetted QQ Plots",
    subtitle = "By Sex of Child",
    x = "Theoretical quartiles",
    y = "Actual Data"
  ) %>%
  gf_theme(legend.position = "none") # Think!

The shapiro.test informs us that the child-related height variables are not normally distributed; though visually there seems nothing much to complain about. Hmmm…

Dads are weird anyway, so we must not expect father heights to be normally distributed.

Workflow: Inference

Let us now see how Correlation Tests can be performed based on this dataset, to infer patterns in the population from which this dataset/sample was drawn.

We will go with classical tests first, and then set up a permutation test that does not need any assumptions.

“Signed Rank” Values

Most statistical tests use the actual values of the data variables. However, in some non-parametric statistical tests, the data are used in rank-transformed sense/order. (In some cases the signed-rank of the data values is used instead of the data itself.)

Signed Rank is calculated as follows:

  1. Take the absolute value of each observation in a sample

  2. Place the ranks in order of (absolute magnitude). The smallest number has rank = 1 and so on. This gives is ranked data.

  3. Give each of the ranks the sign of the original observation ( + or -). This gives us signed ranked data.

signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

Plotting Original and Signed Rank Data

Let us see how this might work by comparing data and its signed-rank version…A quick set of plots:

So the means of the ranks three separate variables seem to be in the same order as the means of the data variables themselves.

How about associations between data? Do ranks reflect well what the data might?

The slopes are almost identical, \(0.25\) for both original data and ranked data for \(y1\sim x\). So maybe ranked and even sign_ranked data could work, and if it can work despite LINE assumptions not being satisfied, that would be nice!

How does Sign-Rank data work?

TBD: need to add some explanation here.

Spearman correlation = Pearson correlation using the rank of the data observations. Let’s check how this holds for a our x and y1 data:

So the Linear Model for the Ranked Data would be:

\[ \begin{aligned} y = \beta_0 + \beta_1 \times rank(x)\\ \\ H_0: Null\ Hypothesis\ => \beta_1 = 0\\\ H_a: Alternate\ Hypothesis\ => \beta_1 \ne 0\\ \end{aligned} \]

Code

Notes:

  1. When ranks are used, the slope of the linear model (\(\beta_1\)) has the same value as the Spearman correlation coefficient ( \(\rho\) ).

  2. Note that the slope from the linear model now has an intuitive interpretation: the number of ranks y changes for each change in rank of x. ( Ranks are “independent” of sd )

Example

We examine the cars93 data, where the numeric variables of interest are weight and price.

ggplot2::theme_set(new = theme_custom())

cars93 %>%
  ggplot(aes(weight, price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, lty = 2) +
  labs(title = "Car Weight and Car Price have a nonlinear relationship") +
  theme_classic()

Let us try a Spearman Correlation score for these variables, since the data are not linearly related and the variance of price also is not constant over weight

ggplot2::theme_set(new = theme_custom())

cor.test(cars93$price, cars93$weight, method = "spearman") %>% broom::tidy()
# Using linear Model
lm(rank(price) ~ rank(weight), data = cars93) %>% summary()

Call:
lm(formula = rank(price) ~ rank(weight), data = cars93)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.0676  -3.0135   0.7815   3.6926  20.4099 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.22074    2.05894   1.564    0.124    
rank(weight)  0.88288    0.06514  13.554   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.46 on 52 degrees of freedom
Multiple R-squared:  0.7794,    Adjusted R-squared:  0.7751 
F-statistic: 183.7 on 1 and 52 DF,  p-value: < 2.2e-16
# Stats Plot
ggstatsplot::ggscatterstats(
  data = cars93, x = weight,
  y = price,
  type = "nonparametric",
  title = "Cars93: Weight vs Price",
  subtitle = "Spearman Correlation"
)

We see that using ranks of the price variable, we obtain a Spearman’s \(\rho = 0.882\) with a p-value that is very small. Hence we are able to reject the NULL hypothesis and state that there is a relationship between these two variables. The linear relationship is evaluated as a correlation of 0.882.

# Other ways using other packages
mosaic::cor_test(gpa ~ study_hours, data = gpa_study_hours) %>%
  broom::tidy() %>%
  select(estimate, p.value, conf.low, conf.high)
statsExpressions::corr_test(
  data = gpa_study_hours,
  x = study_hours,
  y = gpa
)

Wait, But Why?

Correlation tests are useful to understand the relationship between two variables, but they do not imply causation. A high correlation does not mean that one variable causes the other to change. It is essential to consider the context and other factors that may influence the relationship.

Correlations also become an important thing to evaluate in Linear Regression.

Conclusion

Correlation tests are a powerful way to understand the relationship between two variables. They can be performed using classical methods like Pearson and Spearman correlation, or using more robust methods like permutation tests. The linear model approach provides a deeper understanding of the relationship, especially when the assumptions of normality and homoscedasticity are met.

Your Turn

  1. Try the datasets in the infer package. Use data(package = "infer") in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g. babynames) to read it.

  2. Same with the resampledata and resampledata3 packages.

References

  1. Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv
    CheatSheet
  2. Common statistical tests are linear models: a work through by Steve Doogue
  3. Jeffrey Walker “Elements of Statistical Modeling for Experimental Biology”
  4. Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine: OpenIntro Statistics
  5. Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling by Måns Thulin
  6. Jeffrey Walker “A linear-model-can-be-fit-to-data-with-continuous-discrete-or-categorical-x-variables”
  7. https://grok.com/share/c2hhcmQtNA%3D%3D_5a4873eb-9c38-4ee2-a6dc-de1505c415d5

Examples

  1. https://stats.libretexts.org/Courses/Citrus_College/Statistics_C1000%3A_Introduction_to_Statistics/10%3A_Correlation_and_Linear_Regression/10.04%3A_Hypothesis_Test_for_a_Correlation_Using_t-Test
Package Version Citation
easystats 0.7.5 Lüdecke et al. (2022)
ggExtra 0.11.0 Attali and Baker (2025)
ggstatsplot 0.13.3 Patil (2021b)
openintro 2.5.0 Çetinkaya-Rundel et al. (2024)
resampledata3 1.0 Chihara and Hesterberg (2022)
statsExpressions 1.7.1 Patil (2021a)
Attali, Dean, and Christopher Baker. 2025. ggExtra: Add Marginal Histograms to ggplot2,” and More ggplot2 Enhancements. https://doi.org/10.32614/CRAN.package.ggExtra.
Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno, and Christopher Barr. 2024. openintro: Datasets and Supplemental Functions from OpenIntro Textbooks and Labs. https://doi.org/10.32614/CRAN.package.openintro.
Chihara, Laura, and Tim Hesterberg. 2022. Resampledata3: Data Sets for Mathematical Statistics with Resampling and R (3rd Ed). https://doi.org/10.32614/CRAN.package.resampledata3.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Brenton M. Wiernik, Etienne Bacher, Rémi Thériault, and Dominique Makowski. 2022. easystats: Framework for Easy Statistical Modeling, Visualization, and Reporting.” CRAN. https://doi.org/10.32614/CRAN.package.easystats.
Patil, Indrajeet. 2021a. statsExpressions: R Package for Tidy Dataframes and Expressions with Statistical Details.” Journal of Open Source Software 6 (61): 3236. https://doi.org/10.21105/joss.03236.
———. 2021b. Visualizations with statistical details: The ggstatsplot approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.