library(tidyverse) # Sine Qua Non
library(mosaic) # Our bag of tricks
library(ggformula) # Our graphics package
library(broom) # Tidying model outputs
library(crosstable) # tabulated summary stats
library(openintro) # datasets and methods
library(resampledata3) # datasets
library(mosaicData) # datasets
library(statsExpressions) # datasets and methods
library(ggstatsplot) # special stats plots
library(ggExtra)
# Non-CRAN Packages
# remotes::install_github("easystats/easystats")
library(easystats)
Inference for Correlation
Statistics ; Tests; p-value
1 Setting up R packages
Plot Fonts and Theme
Show the Code
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
family = "Alegreya",
regular = "../../../../../../fonts/Alegreya-Regular.ttf",
bold = "../../../../../../fonts/Alegreya-Bold.ttf",
italic = "../../../../../../fonts/Alegreya-Italic.ttf",
bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)
sysfonts::font_add(
family = "Roboto Condensed",
regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
theme_bw(base_size = 10) +
theme_sub_axis(
title = element_text(
family = "Roboto Condensed",
size = 8
),
text = element_text(
family = "Roboto Condensed",
size = 6
)
) +
theme_sub_legend(
text = element_text(
family = "Roboto Condensed",
size = 6
),
title = element_text(
family = "Alegreya",
size = 8
)
) +
theme_sub_plot(
title = element_text(
family = "Alegreya",
size = 14, face = "bold"
),
title.position = "plot",
subtitle = element_text(
family = "Alegreya",
size = 10
),
caption = element_text(
family = "Alegreya",
size = 6
),
caption.position = "plot"
)
}
## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "marquee", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
## Set the theme
ggplot2::theme_set(new = theme_custom())
## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)
## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")
2 Introduction
Correlations define how one variables varies with another. One of the basic Questions we would have of our data is: Does some variable have a significant correlation score with another in some way? Does \(y\) vary with \(x\)? A Correlation Test is designed to answer exactly this question. The block diagram below depicts the statistical procedures available to test for the significance of correlation scores between two variables.
In this module we will explore the correlation coefficient and how to test for its significance. We will also see how to use the linear model method to perform correlation tests, and how to use the permutation test to do so without any assumptions.
2.1 Basic Definitions
Before we begin, let us recap a few basic definitions:
We have already encountered the variance
of a variable:
\[ \begin{align*} var_x &= \frac{\sum_{i=1}^{n}(x_i - \mu_x)^2}{(n-1)}\\ where ~ \mu_x &= mean(x)\\ n &= sample\ size \end{align*} \] The standard deviation is:
\[ \sigma_x = \sqrt{var_x}\\ \] The covariance of two variables is defined as:
\[ \begin{align} cov(x,y) &= \frac{\sum_{i = 1}^{n}(x_i - \mu_x)*(y_i - \mu_y)}{n-1}\\ &= \frac{\sum{x_i *y_i}}{n-1} - \frac{\sum{x_i *\mu_y}}{n-1} - \frac{\sum{y_i *\mu_x}}{n-1} + \frac{\sum{\mu_x *\mu_y}}{n-1}\\ &= \frac{\sum{x_i *y_i}}{n-1} - \frac{\sum{\mu_x *\mu_y}}{n-1}\\ \end{align} \]
Hence covariance is the expectation of the product minus the product of the expectations of the two variables.
So, finally, the coefficient of correlation between two variables is defined as:
\[ \begin{align} correlation ~ r &= \frac{cov(x,y)}{\sigma_x * \sigma_y}\\ &= \frac{\sum_{i = 1}^{n}(x_i - \mu_x)*(y_i - \mu_y)}{(\sigma_x * \sigma_y)(n-1)}\\ &= \frac{\sum_{i = 1}^{n}\left(\frac{x_i - \mu_x}{\sigma_x}\right)*\left(\frac{y_i - \mu_y}{\sigma_y}\right)}{(n-1)}\\ \end{align} \tag{1}\]
which is the average of the product of the z-scores of the two variables.
Note that in both cases we are dealing with z-scores: variable minus its mean, \(\frac{x_i - \mu_x}{\sigma_x}\), which we have seen when dealing with the CLT and the Gaussian Distribution.
3 Case Study #1: Galton’s famous dataset
How can we start, except by using the famous Galton
dataset, now part of the mosaicData
package?
3.1 Workflow: Read and Inspect the Data
data("Galton", package = "mosaicData")
Galton
skimr::skim(Galton)
Name | Galton |
Number of rows | 898 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
family | 0 | 1 | FALSE | 197 | 185: 15, 166: 11, 66: 11, 130: 10 |
sex | 0 | 1 | FALSE | 2 | M: 465, F: 433 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
father | 0 | 1 | 69.23 | 2.47 | 62 | 68 | 69.0 | 71.0 | 78.5 | ▁▅▇▂▁ |
mother | 0 | 1 | 64.08 | 2.31 | 58 | 63 | 64.0 | 65.5 | 70.5 | ▂▅▇▃▁ |
height | 0 | 1 | 66.76 | 3.58 | 56 | 64 | 66.5 | 69.7 | 79.0 | ▁▇▇▅▁ |
nkids | 0 | 1 | 6.14 | 2.69 | 1 | 4 | 6.0 | 8.0 | 15.0 | ▃▇▆▂▁ |
So there are several correlations we can explore here: Children’s height
vs that of father
or mother
, based on sex
. In essence we are replicating Francis Galton’s famous study.
3.2 Data Munging
Note that nkids
, while coded as int
, is actually a factor
variable. Let us convert it to a factor:
3.3 Data Dictionary
-
sex
(fct): sex of the child -
family
(int): an ID for each family -
nkids
(fct): Number of children in each family
-
father
(dbl): father’s height in inches -
mother
(dbl): mother’s height in inches -
height
(dbl): Child’s height in inches
3.4 Workflow: Experiment, and Research Questions
We can say that Galton may have measured the heights of fathers/mothers and their children, and recorded the sex
of the child. He may have been interested in knowing if there is a correlation between the heights of the parents and their children, and if this correlation is different for sons and daughters. Hence the children’s height
is the response variable, and the father
’s/mother
’s heights are explanatory variables, as is sex
of the child.
- Based on this sample, what can we say about the correlation between a son’s height and a father’s height in the population?
- Based on this sample, what can we say about the correlation between a daughter’s height and a father’s height in the population?
Of course we can formulate more questions, but these are good for now! And since we are going to infer correlations by sex
, let us split the dataset into two parts, one for the sons and one for the daughters, and quickly summarise them too:
[1] 465 6
Sons Data
[1] 433 6
Daughters data
Sons Summary Stats
Daughters Summary Stats
Why are father
means different for sons and daughters?? When we filtered the dataset into two, the filtering by sex
of the child also effectively filtered the heights of the father
(and mother
). This is proper and desired; but think!
3.5 Workflow: Visualization
Let us first quickly plot a graph that is relevant to each of the two research questions.
Show the Code
ggplot2::theme_set(new = theme_custom())
Galton_sons %>%
gf_point(son ~ father) %>%
gf_lm() %>%
gf_labs(
x = "Father's Height", y = "Son's Height",
title = "Heights: Sons vs Fathers",
subtitle = "Galton dataset"
)
##
Galton_daughters %>%
gf_point(daughter ~ father) %>%
gf_lm() %>%
gf_labs(
x = "Father's Height", y = "Daughter's Height",
title = "Heights: Daughters vs Fathers",
subtitle = "Galton dataset"
)
We might even plot the overall height
s together and colour by sex
of the child:
ggplot2::theme_set(new = theme_custom())
Galton %>%
gf_point(height ~ father,
group = ~sex, colour = ~sex
) %>%
gf_lm() %>%
gf_refine(scale_color_brewer(palette = "Set1")) %>%
gf_labs(
x = "Father's Height", y = "Children's Height",
title = "Heights: Children vs Fathers",
subtitle = "Galton dataset"
)
So daughters are shorter than sons, generally speaking, and both sets of heights seem related to that of the father.
3.6 Workflow: Assumptions
For the classical correlation tests, we need that the variables are normally distributed. As before we check this with the shapiro.test
:
shapiro.test(Galton_sons$father)
shapiro.test(Galton_sons$son)
##
shapiro.test(Galton_daughters$father)
shapiro.test(Galton_daughters$daughter)
Shapiro-Wilk normality test
data: Galton_sons$father
W = 0.98529, p-value = 0.0001191
Shapiro-Wilk normality test
data: Galton_sons$son
W = 0.99135, p-value = 0.008133
Shapiro-Wilk normality test
data: Galton_daughters$father
W = 0.98438, p-value = 0.0001297
Shapiro-Wilk normality test
data: Galton_daughters$daughter
W = 0.99113, p-value = 0.01071
Let us also check the densities and quartile plots of the height
s in the dataset:
Show the Code
ggplot2::theme_set(new = theme_custom())
Galton %>%
group_by(sex) %>%
gf_density(~height,
group = ~sex,
fill = ~sex
) %>%
gf_fitdistr(dist = "dnorm") %>%
gf_refine(scale_fill_brewer(palette = "Set1")) %>%
gf_facet_grid(vars(sex)) %>%
gf_labs(title = "Facetted Density Plots") %>%
gf_theme(legend.position = "none") # Think!
##
Galton %>%
group_by(sex) %>%
gf_qq(~height,
group = ~sex,
colour = ~sex, size = 0.5
) %>%
gf_qqline(colour = "black") %>%
gf_refine(scale_color_brewer(palette = "Set1")) %>%
gf_facet_grid(vars(sex)) %>%
gf_labs(
title = "Facetted QQ Plots",
x = "Theoretical quartiles",
y = "Actual Data"
) %>%
gf_theme(legend.position = "none") # Think!
and the father
’s heights:
Show the Code
ggplot2::theme_set(new = theme_custom())
Galton %>%
group_by(sex) %>%
gf_density(~father,
group = ~sex, # no this is not weird
fill = ~sex
) %>%
gf_fitdistr(dist = "dnorm") %>%
gf_refine(scale_fill_brewer(name = "Sex of Child", palette = "Set1")) %>%
gf_facet_grid(vars(sex)) %>%
gf_labs(
title = "Fathers: Facetted Density Plots",
subtitle = "By Sex of Child"
) %>%
gf_theme(legend.position = "none") # Think!
Galton %>%
group_by(sex) %>%
gf_qq(~father,
group = ~sex, # no this is not weird
colour = ~sex, size = 0.5
) %>%
gf_qqline(colour = "black") %>%
gf_facet_grid(vars(sex)) %>%
gf_refine(scale_colour_brewer(name = "Sex of Child", palette = "Set1")) %>%
gf_labs(
title = "Fathers Heights: Facetted QQ Plots",
subtitle = "By Sex of Child",
x = "Theoretical quartiles",
y = "Actual Data"
) %>%
gf_theme(legend.position = "none") # Think!
The shapiro.test
informs us that the child-related height
variables are not normally distributed; though visually there seems nothing much to complain about. Hmmm…
Dads are weird anyway, so we must not expect father
heights to be normally distributed.
3.7 Workflow: Inference
Let us now see how Correlation Tests can be performed based on this dataset, to infer patterns in the population from which this dataset/sample was drawn.
We will go with classical tests first, and then set up a permutation test that does not need any assumptions.
We perform the Pearson correlation test first: the data is not normal so we cannot really use this. We should use a non-parametric correlation test as well, using a Spearman
correlation.
Both tests state that the correlation between son
and father
is significant.
Again both tests state that the correlation between daughter
and father
is significant.
What is happening under the hood in cor.test
?
Given that the correlation coefficient \(r\) is a measure of the linear relationship between two variables, we can test its significance using a t-test. The formula for the t-value in a correlation test is derived from the relationship between the correlation coefficient and the t-distribution.
The formula for the t-value is given by: \[t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} \tag{2}\] where:
- \(t\) is the t-statistic,
- \(r\) is the Pearson correlation coefficient,
- \(n\) is the number of paired observations (sample size).
- The degrees of freedom for this t-test is \(df = n - 2\).
The derivation of this t-statistic stems from the fact that the correlation coefficient can be expressed in terms of the slope of the regression line when one variable is regressed on the other. The t-test essentially tests whether the slope of this regression line is significantly different from zero, which would indicate a significant linear relationship between the two variables.
OK, if you like, you can stop here! But for you intrepid, beamish people who possess vorpal swords, here come the Dragons!!
In Linear Regression, we saw that the F-statistic
is a ratio of variances, and that it follows an F-distribution
. How does this relate to correlation?
- In regression we have a target variable (sons’
height
s) and a predictor variable (father
s` heights), same as with our present study of correlation. - We look at how much one variable explains, or reduces the variance the other
- i.e. How much the variance of the target variable (sons’
height
s) is reduced by the fact that we know the value(s) of a predictor variable (father
s` heights). - Our measure of how well this reduction is happening is a ratio: a ratio of variances, also denoted as \(r^2\), i.e. the square of the correlation coefficient.
- We take the variance of the target variable, and the variance of the target variable after we have accounted for the predictor variable.
- The ratio of variances follows an
F-distribution
with appropriate degrees of freedom - This gives us our
F-value
, computed from our data.- We compare our computed
F-value
with the critical F-valueF-crit
for a probability of error of \(5\%\), given by theF-distribution
with appropriate degrees of freedom. - If our
F-value
is well aboveF_crit
, we state there is there very low probability (i.e.p-value
) that this reduction happened simply by change and accept the hypothesis that there is significant correlation between the two variables.
- We compare our computed
- We relate the idea of Regression to Correlation by noting that the
t-value
for correlation must be simply the square root of theF-value
for regression, since theF-distribution
is for \(r^2\) and thet-distribution
is for \(r\).
Why does the variance-ratio have a distribution??? The two variances appear to be single numbers !!! Is there anything that is not random in statistics?
- Remember, we treat our data as a sample from a population, in order to estimate the
regression slope
for the population. - The sample is random, and hence the variance of the sample is also random.
- If we took another sample, we would get a different variance.
- So the variance of a sample is a random variable, and hence the ratio-of-variances also has a distribution. Phew!
Why does the ratio-of-variances have an F-distribution?
- In our case, our residuals (deviations from the means) are assumed to be normal.
- The variance calculation squares these normally-distributed residuals, leading to a chi-square distribution for the individual variances.
- The ratio of these two independent chi-squared variables, each divided by their respective degrees of freedom, follows an F-distribution.
- This is a fundamental result in statistics that underpins the use of the F-test in statistical analysis.
To derive the formula for the t-value
in a correlation test, starting with a ratio of variances, we need to focus on the t-test for the significance of the Pearson correlation coefficient. The t-value assesses whether the observed correlation coefficient \(r\) is significantly different from zero. Let’s proceed step-by-step, connecting the t-test to variances and ensuring a clear derivation.
Step 1: Understanding the Pearson Correlation Coefficient
The Pearson correlation coefficient \(r\) measures the linear relationship between two variables \(X\) and \(Y\). It is defined as:
\[r = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y}) / (n-1)}{\sqrt{\left( \sum (x_i - \bar{x})^2 / (n-1) \right) \left( \sum (y_i - \bar{y})^2 / (n-1) \right)}}\] where:
\(\text{Cov}(X, Y)\) is the covariance of \(X\) and \(Y\), \(\text{Var}(X)\) and \(\text{Var}(Y)\) are the variances of \(X\) and \(Y\), \(x_i\) and \(y_i\) are the data points, \(\bar{x}\) and \(\bar{y}\) are the means, \(n\) is the sample size.
Simplifying, we get: \[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}\]
This formula shows \(r\) as a standardized measure of covariance relative to the product of standard deviations (square roots of variances).
Step 2: Hypothesis Testing for Correlation
To test whether the correlation is significantly different from zero, we use a t-test. The null hypothesis (\(H_0\)) is that the population correlation coefficient \(\rho = 0\), and the alternative hypothesis (\(H_a\)) is \(\rho \neq 0\) (for a two-tailed test). The t-statistic for testing the significance of \(r\) is commonly given as:
\[t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} \tag{3}\]
This t-statistic follows a t-distribution with \(n - 2\) degrees of freedom under the null hypothesis. Our goal is to derive this formula, starting from a perspective involving variances.
Step 3: Connecting to Variances via Linear Regression
The t-test for the correlation coefficient is closely related to the t-test for the slope of a linear regression model. Suppose we regress \(Y\) on \(X\): \[Y = \beta_0 + \beta_1 X + \epsilon\] The slope \(\beta_1\) estimates the change in \(Y\) per unit change in \(X\). The sample slope \(b_1\) is: \[b_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\] Notice the relationship between \(b_1\) and \(r\). Since: \[r = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}\]
We can express the covariance as: \[\text{Cov}(X, Y) = r \sqrt{\text{Var}(X) \text{Var}(Y)}\] So: \[ b_1 = \frac{r \sqrt{\text{Var}(X) \text{Var}(Y)}}{\text{Var}(X)} = r \sqrt{\frac{\text{Var}(Y)}{\text{Var}(X)}} = r \frac{s_y}{s_x} \tag{4}\]
where \(s_x = \sqrt{\text{Var}(X)} = \sqrt{\sum (x_i - \bar{x})^2 / (n-1)}\) and \(s_y = \sqrt{\text{Var}(Y)}\) are the sample standard deviations.
Step 4: t-Test for the Regression Slope
To test \(H_0: \beta_1 = 0\), we use the t-statistic
for the slope:
\[t = \frac{b_1}{\text{SE}(b_1)}\] where \(\text{SE}(b_1)\) is the standard error of the slope
. The variance of the slope estimate is derived from the regression model. Assuming the errors \(\epsilon\) are normally distributed with variance \(\sigma^2\), the variance of \(b_1\) is:
\[\text{Var}(b_1) = \frac{\sigma^2}{\sum (x_i - \bar{x})^2}\] The standard error is: \[\text{SE}(b_1) = \sqrt{\frac{\sigma^2}{\sum (x_i - \bar{x})^2}}\] Since \(\sigma^2\) is unknown, we estimate it with the residual variance \(s^2\): \[s^2 = \frac{\sum (y_i - \hat{y}_i)^2}{n - 2}\] where \(\hat{y}_i = \bar{y} + b_1 (x_i - \bar{x})\) are the fitted values, and \(n - 2\) accounts for the degrees of freedom (two parameters estimated: \(\beta_0\) and \(\beta_1\)). Thus: \[\text{SE}(b_1) = \frac{s}{\sqrt{\sum (x_i - \bar{x})^2}}\] So the t-statistic is:
\[t = \frac{b_1 \sqrt{\sum (x_i - \bar{x})^2}}{s}\]
Step 5: Relating the Residual Variance to \(r\)
The residual sum of squares is: \[\sum (y_i - \hat{y}_i)^2 = \sum (y_i - \bar{y} - b_1 (x_i - \bar{x}))^2\]
To connect this to \(r\), consider the proportion of variance explained by the regression. The coefficient of determination \(r^2\) (for simple linear regression, this is the square of the correlation coefficient) is: \[r^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}\] where:
- \(\text{SSR} = \sum (\hat{y}_i - \bar{y})^2\) (sum of squares due to regression),
- \(\text{SSE} = \sum (y_i - \hat{y}_i)^2\) (sum of squares of errors),
- \(\text{SST} = \sum (y_i - \bar{y})^2\) (total sum of squares).
Since \(\hat{y}_i - \bar{y} = b_1 (x_i - \bar{x})\), we have:
\[\text{SSR} = \sum (b_1 (x_i - \bar{x}))^2 = b_1^2 \sum (x_i - \bar{x})^2\]
Substitute \(b_1 = r \frac{s_y}{s_x}\):
\[b_1^2 = r^2 \frac{s_y^2}{s_x^2} = r^2 \frac{\sum (y_i - \bar{y})^2 / (n-1)}{\sum (x_i - \bar{x})^2 / (n-1)} = r^2 \frac{\sum (y_i - \bar{y})^2}{\sum (x_i - \bar{x})^2}\]
So: \[\text{SSR} = r^2 \frac{\sum (y_i - \bar{y})^2}{\sum (x_i - \bar{x})^2} \cdot \sum (x_i - \bar{x})^2 = r^2 \sum (y_i - \bar{y})^2\] Thus: \[r^2 = \frac{\text{SSR}}{\text{SST}} = \frac{r^2 \sum (y_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} = r^2\]
This confirms consistency. Now, the residual sum of squares is: \[\text{SSE} = \text{SST} (1 - r^2) = (1 - r^2) \sum (y_i - \bar{y})^2\]
The residual variance is: \[s^2 = \frac{\text{SSE}}{n - 2} = \frac{(1 - r^2) \sum (y_i - \bar{y})^2}{n - 2}\] So: \[s = \sqrt{\frac{(1 - r^2) \sum (y_i - \bar{y})^2}{n - 2}}\]
Step 6: Substitute into the t-Statistic
Recall: \[t = \frac{b_1 \sqrt{\sum (x_i - \bar{x})^2}}{s}\]
Substitute \(b_1 = r \frac{s_y}{s_x}\), where \(s_y = \sqrt{\sum (y_i - \bar{y})^2 / (n-1)}\), \(s_x = \sqrt{\sum (x_i - \bar{x})^2 / (n-1)}\):
\[b_1 = r \frac{\sqrt{\sum (y_i - \bar{y})^2 / (n-1)}}{\sqrt{\sum (x_i - \bar{x})^2 / (n-1)}} = r \sqrt{\frac{\sum (y_i - \bar{y})^2}{\sum (x_i - \bar{x})^2}}\]
Now compute: \[b_1 \sqrt{\sum (x_i - \bar{x})^2} = r \sqrt{\frac{\sum (y_i - \bar{y})^2}{\sum (x_i - \bar{x})^2}} \cdot \sqrt{\sum (x_i - \bar{x})^2} = r \sqrt{\sum (y_i - \bar{y})^2}\]
The standard error term involves: \[s = \sqrt{\frac{(1 - r^2) \sum (y_i - \bar{y})^2}{n - 2}}\]
So: \[t = \frac{r \sqrt{\sum (y_i - \bar{y})^2}}{\sqrt{\frac{(1 - r^2) \sum (y_i - \bar{y})^2}{n - 2}}} = \frac{r \sqrt{\sum (y_i - \bar{y})^2} \cdot \sqrt{n - 2}}{\sqrt{(1 - r^2) \sum (y_i - \bar{y})^2}}\]
The \(\sqrt{\sum (y_i - \bar{y})^2}\) terms cancel out: \[t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}\]
Step 7: Linking to Ratio of Variances
Our inquiry started with a “ratio of variances.” In the context of the t-test, the t-statistic can be interpreted through the lens of explained versus unexplained variance. From the regression perspective: \[r^2 = \frac{\text{SSR}}{\text{SST}} = \frac{\text{Explained Variance}}{\text{Total Variance}}\] The unexplained variance is: \[1 - r^2 = \frac{\text{SSE}}{\text{SST}}\] The t-statistic can be related to the F-statistic for regression, where: \[F = \frac{\text{SSR}/1}{\text{SSE}/(n-2)} = \frac{r^2 / 1}{(1 - r^2)/(n-2)}\] For simple linear regression, the t-statistic for the slope is the square root of the F-statistic: \[t^2 = F\] Let’s compute: \[F = \frac{r^2 (n - 2)}{1 - r^2}\] \[t = \sqrt{F} = \sqrt{\frac{r^2 (n - 2)}{1 - r^2}} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}\] This matches our derived t-statistic, confirming that the ratio of explained to unexplained variance underpins the test.
Final Answer
The t-value for testing the significance of the Pearson correlation coefficient \(r\), derived from the perspective of variances in a regression framework, is:
\[t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}\] This formula arises from the ratio of explained to unexplained variance in the regression model, where \(r^2\) represents the proportion of variance in \(Y\) explained by \(X\), and \(1 - r^2\) represents the unexplained variance, adjusted by the degrees of freedom \(n - 2\).
\[\Large{\boxed{t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}}}\]
On to the computations!
< WORK IN PROGRESS >
A. Data Dimensions
[1] 465 6
Sons Data: n = 465
B. Variances
[1] 6.925288
Variance of Sons’ Heights
B. Variances
[1] 5.289674
Variance of Fathers’ Heights
C. Covariance
[1] 2.368441
Covariance of Sons’ and Fathers’ Heights
E. Estimate
[1] 0.3913174
Correlation Estimate for Sons
F. t-statistics
We can now compute the t-statistic
using the formula: \[
t_{statistic} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}
\] We can look up the critical value of t from the t-distribution
with \(df = 463\) at a probability of error of \(5\%\):
[1] 9.149788
F statistic for Sons
[1] 1.965101
Critical t value for Sons
G. p-value
Finally we can compute the p-value
for this t-statistic
:
(p_value_son <- 2.0 * (1 - pt(q = t_statistic, df = 465 - 2)))
[1] 0
We see that the p-value
is very small, and we can reject the null hypothesis of “no correlation” between son
and father
heights.
H. Plotting the t-distribution
Show the Code
mosaic::xqt(
p = c(0.025, 0.975), df = 463,
return = c("plot"), alpha = 0.5,
colour = "black", system = "gg"
) %>%
gf_vline(xintercept = t_statistic, color = "red") %>%
gf_vline(xintercept = t_critical, color = "blue") %>%
gf_annotate(
geom = "label", x = t_statistic, y = 0.3,
label = "t-statistic", colour = "purple", size = 8
) %>%
gf_annotate(
geom = "label", x = t_critical, y = 0.3,
label = "t-critical", colour = "purple", size = 8
) %>%
gf_labs(
title = "t-distribution with df = 463",
subtitle = "Sons' Heights vs Fathers' Heights",
x = "t-value", y = "Density"
) %>%
gf_refine(
scale_y_continuous(expand = expansion(mult = c(0, 0.1))),
scale_x_continuous(
breaks = c(-3, -2, -1, 0, 1, 2, 3, t_statistic, t_critical, 10),
limits = c(-4, 10),
labels = scales::number_format(accuracy = 0.01)
)
)
###
df_corr <- cor_son_pearson$parameter
mean_value <- cor_son_pearson$estimate
gf_fun(dt(
x = (x - mean_value) / sqrt(df_corr / (df_corr - 2)),
df = df_corr
) * (1 / sqrt(df_corr / (df_corr - 2))) ~ x, xlim = c(-5, 5)) %>%
gf_labs(
title = "t-Distribution with Non-Zero Mean",
subtitle = "Sons' Heights vs Fathers' Heights",
x = "Correlation Estimate", y = "Density"
) %>%
gf_vline(xintercept = mean_value, color = "red") %>%
gf_annotate(
geom = "label", x = mean_value, y = 0.3,
label = "Corr Estimate", colour = "purple", size = 8
) %>%
gf_vline(xintercept = cor_son_pearson$conf.low, color = "blue", linetype = "dashed") %>%
gf_vline(xintercept = cor_son_pearson$conf.high, color = "blue", linetype = "dashed") %>%
gf_annotate("segment",
x = cor_son_pearson$conf.low,
xend = cor_son_pearson$conf.high, y = 0.25, yend = 0.25,
arrow = arrow(ends = "both", length = unit(0.2, "inches")), color = "blue", size = 1
)
We can of course use a randomization based test for correlation. How would we mechanize this, what aspect would be randomize?
Correlation is calculated on a vector-basis: each individual observation of variable#1
is multiplied by the corresponding observation of variable#2
. Look at Equation 1! So we might be able to randomize the order of this multiplication to see how uncommon this particular set of multiplications are. That would give us a p-value
to decide if the observed correlation is close to the truth. So, onwards with our friend mosaic
:
obs_daughter_corr <- cor(Galton_daughters$father, Galton_daughters$daughter)
obs_daughter_corr
[1] 0.4587605
##
p_value_null <- 2.0 * mean(corr_daughter_null$estimate >= obs_daughter_corr)
p_value_null
[1] 0
We see that will all permutations of father
, we are never able to hit the actual obs_daughter_corr
! Hence there is a definite correlation between father
height and daughter
height.
The premise here is that many common statistical tests are special cases of the linear model. A linear model estimates the relationship between dependent variable or
“response” variable height
and an explanatory variable or “predictor”, father
. It is assumed that the relationship is linear. \(\beta_0\) is the intercept and \(\beta_1\) is the slope of the linear fit, that predicts the value of height
based the value of father
.
\[ height = \beta_0 + \beta_1 \times father \] The model for Pearson Correlation tests is exactly the Linear Model:
\[ \begin{aligned} height = \beta_0 + \beta_1 \times father\\ \\ H_0: Null\ Hypothesis\ => \beta_1 = 0\\\ H_a: Alternate\ Hypothesis\ => \beta_1 \ne 0\\ \end{aligned} \]
Using the linear model method we get:
Why are the respective \(r\)-s and \(\beta_1\)-s different, though the p-value
-s is suspiciously the same!? Did we miss a factor of \(\frac{sd(son/daughter)}{sd(father)} = ??\) somewhere…??
Let us scale the variables to within {-1, +1}
: (subtract the mean and divide by sd) and re-do the Linear Model with scaled versions of height
and father
:
Now you’re talking!! The estimate
is the same in both the classical test and the linear model! So we conclude:
When both target and predictor have the same standard deviation, the slope from the linear model and the Pearson correlation are the same.
There is this relationship between the slope in the linear model and Pearson correlation:
\[ Slope\ \beta_1 = \frac{sd_y}{sd_x} * r \]
The slope is usually much more interpretable and informative than the correlation coefficient.
- Hence a linear model using
scale()
for both variables will show slope = r.
Slope_Scaled: 0.4587605 = Correlation: 0.4587605
- Finally, the p-value for Pearson Correlation and that for the slope in the linear model is the same (\(0.04280043\)). Which means we cannot reject the NULL hypothesis of “no relationship” between
daughter
-s andfather
-s heights.
Can you complete this for the sons?
5 Wait, But Why?
Correlation tests are useful to understand the relationship between two variables, but they do not imply causation. A high correlation does not mean that one variable causes the other to change. It is essential to consider the context and other factors that may influence the relationship.
Correlations also become an important thing to evaluate in Linear Regression.
6 Conclusion
Correlation tests are a powerful way to understand the relationship between two variables. They can be performed using classical methods like Pearson and Spearman correlation, or using more robust methods like permutation tests. The linear model approach provides a deeper understanding of the relationship, especially when the assumptions of normality and homoscedasticity are met.
7 Your Turn
Try the datasets in the
infer
package. Usedata(package = "infer")
in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g.babynames
) to read it.Same with the
resampledata
andresampledata3
packages.
8 References
-
Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv
CheatSheet
-
Common statistical tests are linear models: a work through by Steve Doogue
-
Jeffrey Walker “Elements of Statistical Modeling for Experimental Biology”
- Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine: OpenIntro Statistics
- Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling by Måns Thulin
- Jeffrey Walker “A linear-model-can-be-fit-to-data-with-continuous-discrete-or-categorical-x-variables”
- https://grok.com/share/c2hhcmQtNA%3D%3D_5a4873eb-9c38-4ee2-a6dc-de1505c415d5
9 Examples
Citation
@online{v.2022,
author = {V., Arvind},
title = {Inference for {Correlation}},
date = {2022-11-25},
url = {https://madhatterguide.netlify.app/content/courses/Analytics/20-Inference/Modules/150-Correlation/},
langid = {en},
abstract = {Statistical Significance Tests for Correlations between
two Variables}
}