library(tidyverse) # Sine Qua Nonlibrary(mosaic) # Our bag of trickslibrary(broom) # Tidying model outputslibrary(crosstable) # tabulated summary statslibrary(openintro) # datasets and methodslibrary(resampledata3) # datasetslibrary(mosaicData) # datasetslibrary(statsExpressions) # datasets and methodslibrary(ggstatsplot) # special stats plotslibrary(ggExtra)# Non-CRAN Packages# remotes::install_github("easystats/easystats")library(easystats)
Plot Fonts and Theme
Code
library(systemfonts)library(showtext)## Clean the slatesystemfonts::clear_local_fonts()systemfonts::clear_registry()##showtext_opts(dpi =96) # set DPI for showtextsysfonts::font_add(family ="Alegreya",regular ="../../../../../../fonts/Alegreya-Regular.ttf",bold ="../../../../../../fonts/Alegreya-Bold.ttf",italic ="../../../../../../fonts/Alegreya-Italic.ttf",bolditalic ="../../../../../../fonts/Alegreya-BoldItalic.ttf")sysfonts::font_add(family ="Roboto Condensed",regular ="../../../../../../fonts/RobotoCondensed-Regular.ttf",bold ="../../../../../../fonts/RobotoCondensed-Bold.ttf",italic ="../../../../../../fonts/RobotoCondensed-Italic.ttf",bolditalic ="../../../../../../fonts/RobotoCondensed-BoldItalic.ttf")showtext_auto(enable =TRUE) # enable showtext##theme_custom <-function() {theme_bw(base_size =10) +# theme(panel.widths = unit(11, "cm"),# panel.heights = unit(6.79, "cm")) + # Golden Ratiotheme(plot.margin =margin_auto(t =1, r =2, b =1, l =1, unit ="cm"),plot.background =element_rect(fill ="bisque",colour ="black",linewidth =1 ) ) +theme_sub_axis(title =element_text(family ="Roboto Condensed",size =10 ),text =element_text(family ="Roboto Condensed",size =8 ) ) +theme_sub_legend(text =element_text(family ="Roboto Condensed",size =6 ),title =element_text(family ="Alegreya",size =8 ) ) +theme_sub_plot(title =element_text(family ="Alegreya",size =14, face ="bold" ),title.position ="plot",subtitle =element_text(family ="Alegreya",size =10 ),caption =element_text(family ="Alegreya",size =6 ),caption.position ="plot" )}## Use available fonts in ggplot text geoms too!ggplot2::update_geom_defaults(geom ="text", new =list(family ="Roboto Condensed",face ="plain",size =3.5,color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="label", new =list(family ="Roboto Condensed",face ="plain",size =3.5,color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="marquee", new =list(family ="Roboto Condensed",face ="plain",size =3.5,color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="text_repel", new =list(family ="Roboto Condensed",face ="plain",size =3.5,color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="label_repel", new =list(family ="Roboto Condensed",face ="plain",size =3.5,color ="#2b2b2b"))## Set the themeggplot2::theme_set(new =theme_custom())## tinytable optionsoptions("tinytable_tt_digits"=2)options("tinytable_format_num_fmt"="significant_cell")options(tinytable_html_mathjax =TRUE)## Set defaults for flextableflextable::set_flextable_defaults(font.family ="Roboto Condensed")
Introduction
Correlations define how one variables varies with another. One of the basic Questions we would have of our data is: Does some variable have a significant correlation score with another in some way? Does \(y\) vary with \(x\)? A Correlation Test is designed to answer exactly this question. The block diagram below depicts the statistical procedures available to test for the significance of correlation scores between two variables.
In this module we will explore the correlation coefficient and how to test for its significance. We will also see how to use the linear model method to perform correlation tests, and how to use the permutation test to do so without any assumptions.
Basic Definitions
Before we begin, let us recap a few basic definitions:
We have already encountered the variance of a variable:
\[
\begin{align*}
var_x &= \frac{\sum_{i=1}^{n}(x_i - \mu_x)^2}{(n-1)}\\
where ~ \mu_x &= mean(x)\\
n &= sample\ size
\end{align*}
\] The standard deviation is:
\[
\sigma_x = \sqrt{var_x}\\
\] The covariance of two variables is defined as:
Note that in both cases we are dealing with z-scores: variable minus its mean, \(\frac{x_i - \mu_x}{\sigma_x}\), which we have seen when dealing with the CLT and the Gaussian Distribution.
Case Study #1: Galton’s famous dataset
How can we start, except by using the famous Galton dataset, now part of the mosaicData package?
Workflow: Read and Inspect the Data
data("Galton", package ="mosaicData")Galton
skimr::skim(Galton)
Data summary
Name
Galton
Number of rows
898
Number of columns
6
_______________________
Column type frequency:
factor
2
numeric
4
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
family
0
1
FALSE
197
185: 15, 166: 11, 66: 11, 130: 10
sex
0
1
FALSE
2
M: 465, F: 433
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
father
0
1
69.23
2.47
62
68
69.0
71.0
78.5
▁▅▇▂▁
mother
0
1
64.08
2.31
58
63
64.0
65.5
70.5
▂▅▇▃▁
height
0
1
66.76
3.58
56
64
66.5
69.7
79.0
▁▇▇▅▁
nkids
0
1
6.14
2.69
1
4
6.0
8.0
15.0
▃▇▆▂▁
So there are several correlations we can explore here: Children’s height vs that of father or mother, based on sex. In essence we are replicating Francis Galton’s famous study.
Data Munging
Note that nkids, while coded as int, is actually a factor variable. Let us convert it to a factor:
We can say that Galton may have measured the heights of fathers/mothers and their children, and recorded the sex of the child. He may have been interested in knowing if there is a correlation between the heights of the parents and their children, and if this correlation is different for sons and daughters. Hence the children’s height is the response variable, and the father’s/mother’s heights are explanatory variables, as is sex of the child.
Question 1
Based on this sample, what can we say about the correlation between a son’s height and a father’s height in the population?
Question 2
Based on this sample, what can we say about the correlation between a daughter’s height and a father’s height in the population?
Of course we can formulate more questions, but these are good for now! And since we are going to infer correlations by sex, let us split the dataset into two parts, one for the sons and one for the daughters, and quickly summarise them too:
Why are father means different for sons and daughters?? When we filtered the dataset into two, the filtering by sex of the child also effectively filtered the heights of the father (and mother). This is proper and desired; but think!
Workflow: Visualization
Let us first quickly plot a graph that is relevant to each of the two research questions.
Code
ggplot2::theme_set(new =theme_custom())Galton_sons %>%gf_point(son ~ father) %>%gf_lm() %>%gf_labs(x ="Father's Height", y ="Son's Height",title ="Heights: Sons vs Fathers",subtitle ="Galton dataset" )##Galton_daughters %>%gf_point(daughter ~ father) %>%gf_lm() %>%gf_labs(x ="Father's Height", y ="Daughter's Height",title ="Heights: Daughters vs Fathers",subtitle ="Galton dataset" )
(a) Sons
(b) Daughters
Figure 1: Heights of Sons and Daughters vs Fathers
We might even plot the overallheights together and colour by sex of the child:
ggplot2::theme_set(new =theme_custom())Galton %>%gf_point(height ~ father,group =~sex, colour =~sex ) %>%gf_lm() %>%gf_refine(scale_color_brewer(palette ="Set1")) %>%gf_labs(x ="Father's Height", y ="Children's Height",title ="Heights: Children vs Fathers",subtitle ="Galton dataset" )
So daughters are shorter than sons, generally speaking, and both sets of heights seem related to that of the father.
Workflow: Assumptions
For the classical correlation tests, we need that the variables are normally distributed. As before we check this with the shapiro.test:
ggplot2::theme_set(new =theme_custom())Galton %>%group_by(sex) %>%gf_density(~father,group =~sex, # no this is not weirdfill =~sex ) %>%gf_fitdistr(dist ="dnorm") %>%gf_refine(scale_fill_brewer(name ="Sex of Child", palette ="Set1")) %>%gf_facet_grid(vars(sex)) %>%gf_labs(title ="Fathers: Facetted Density Plots",subtitle ="By Sex of Child" ) %>%gf_theme(legend.position ="none") # Think!Galton %>%group_by(sex) %>%gf_qq(~father,group =~sex, # no this is not weirdcolour =~sex, size =0.5 ) %>%gf_qqline(colour ="black") %>%gf_facet_grid(vars(sex)) %>%gf_refine(scale_colour_brewer(name ="Sex of Child", palette ="Set1")) %>%gf_labs(title ="Fathers Heights: Facetted QQ Plots",subtitle ="By Sex of Child",x ="Theoretical quartiles",y ="Actual Data" ) %>%gf_theme(legend.position ="none") # Think!
The shapiro.test informs us that the child-related height variables are not normally distributed; though visually there seems nothing much to complain about. Hmmm…
Dads are weird anyway, so we must not expect father heights to be normally distributed.
Workflow: Inference
Let us now see how Correlation Tests can be performed based on this dataset, to infer patterns in the population from which this dataset/sample was drawn.
We will go with classical tests first, and then set up a permutation test that does not need any assumptions.
We perform the Pearson correlation test first: the data is not normal so we cannot really use this. We should use a non-parametric correlation test as well, using a Spearman correlation.
Again both tests state that the correlation between daughter and father is significant.
What is happening under the hood in cor.test?
Given that the correlation coefficient \(r\) is a measure of the linear relationship between two variables, we can test its significance using a t-test. The formula for the t-value in a correlation test is derived from the relationship between the correlation coefficient and the t-distribution.
The formula for the t-value is given by: \[t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} \tag{2}\] where:
\(t\) is the t-statistic,
\(r\) is the Pearson correlation coefficient,
\(n\) is the number of paired observations (sample size).
The degrees of freedom for this t-test is \(df = n - 2\).
The derivation of this t-statistic stems from the fact that the correlation coefficient can be expressed in terms of the slope of the regression line when one variable is regressed on the other. The t-test essentially tests whether the slope of this regression line is significantly different from zero, which would indicate a significant linear relationship between the two variables.
In Linear Regression, we saw that the F-statistic is a ratio of variances, and that it follows an F-distribution. How does this relate to correlation?
In regression we have a target variable (sons’ heights) and a predictor variable (fathers` heights), same as with our present study of correlation.
We look at how much one variable explains, or reduces the variance the other
i.e. How much the variance of the target variable (sons’ heights) is reduced by the fact that we know the value(s) of a predictor variable (fathers` heights).
Our measure of how well this reduction is happening is a ratio: a ratio of variances, also denoted as \(r^2\), i.e. the square of the correlation coefficient.
We take the variance of the target variable, and the variance of the target variable after we have accounted for the predictor variable.
The ratio of variances follows an F-distribution with appropriate degrees of freedom
This gives us our F-value, computed from our data.
We compare our computed F-value with the critical F-valueF-crit for a probability of error of \(5\%\), given by the F-distribution with appropriate degrees of freedom.
If our F-value is well above F_crit, we state there is there very low probability (i.e.p-value) that this reduction happened simply by change and accept the hypothesis that there is significant correlation between the two variables.
We relate the idea of Regression to Correlation by noting that the t-value for correlation must be simply the square root of the F-value for regression, since the F-distribution is for \(r^2\) and the t-distribution is for \(r\).
A Distribution for a Variance-Ratio??
Why does the variance-ratio have a distribution??? The two variances appear to be single numbers !!! Is there anything that is not random in statistics?
Remember, we treat our data as a sample from a population, in order to estimate the regression slope for the population.
The sample is random, and hence the variance of the sample is also random.
If we took another sample, we would get a different variance.
So the variance of a sample is a random variable, and hence the ratio-of-variances also has a distribution. Phew!
The F-statistic and its Distribution
Why does the ratio-of-variances have an F-distribution?
In our case, our residuals (deviations from the means) are assumed to be normal.
The variance calculation squares these normally-distributed residuals, leading to a chi-square distribution for the individual variances.
The ratio of these two independent chi-squared variables, each divided by their respective degrees of freedom, follows an F-distribution.
This is a fundamental result in statistics that underpins the use of the F-test in statistical analysis.
To derive the formula for the t-value in a correlation test, starting with a ratio of variances, we need to focus on the t-test for the significance of the Pearson correlation coefficient. The t-value assesses whether the observed correlation coefficient \(r\) is significantly different from zero. Let’s proceed step-by-step, connecting the t-test to variances and ensuring a clear derivation.
Step 1: Understanding the Pearson Correlation Coefficient
The Pearson correlation coefficient \(r\) measures the linear relationship between two variables \(X\) and \(Y\). It is defined as:
\(\text{Cov}(X, Y)\) is the covariance of \(X\) and \(Y\), \(\text{Var}(X)\) and \(\text{Var}(Y)\) are the variances of \(X\) and \(Y\), \(x_i\) and \(y_i\) are the data points, \(\bar{x}\) and \(\bar{y}\) are the means, \(n\) is the sample size.
This formula shows \(r\) as a standardized measure of covariance relative to the product of standard deviations (square roots of variances).
Step 2: Hypothesis Testing for Correlation
To test whether the correlation is significantly different from zero, we use a t-test. The null hypothesis (\(H_0\)) is that the population correlation coefficient\(\rho = 0\), and the alternative hypothesis (\(H_a\)) is \(\rho \neq 0\) (for a two-tailed test). The t-statistic for testing the significance of \(r\) is commonly given as:
This t-statistic follows a t-distribution with \(n - 2\) degrees of freedom under the null hypothesis. Our goal is to derive this formula, starting from a perspective involving variances.
Step 3: Connecting to Variances via Linear Regression
The t-test for the correlation coefficient is closely related to the t-test for the slope of a linear regression model. Suppose we regress \(Y\) on \(X\): \[Y = \beta_0 + \beta_1 X + \epsilon\] The slope \(\beta_1\) estimates the change in \(Y\) per unit change in \(X\). The sample slope \(b_1\) is: \[b_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\] Notice the relationship between \(b_1\) and \(r\). Since: \[r = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}\]
We can express the covariance as: \[\text{Cov}(X, Y) = r \sqrt{\text{Var}(X) \text{Var}(Y)}\] So: \[
b_1 = \frac{r \sqrt{\text{Var}(X) \text{Var}(Y)}}{\text{Var}(X)} = r \sqrt{\frac{\text{Var}(Y)}{\text{Var}(X)}} = r \frac{s_y}{s_x}
\tag{4}\]
where \(s_x = \sqrt{\text{Var}(X)} = \sqrt{\sum (x_i - \bar{x})^2 / (n-1)}\) and \(s_y = \sqrt{\text{Var}(Y)}\) are the sample standard deviations.
Step 4: t-Test for the Regression Slope
To test \(H_0: \beta_1 = 0\), we use the t-statistic for the slope:
\[t = \frac{b_1}{\text{SE}(b_1)}\] where \(\text{SE}(b_1)\) is the standard error of the slope. The variance of the slope estimate is derived from the regression model. Assuming the errors \(\epsilon\) are normally distributed with variance \(\sigma^2\), the variance of \(b_1\) is:
\[\text{Var}(b_1) = \frac{\sigma^2}{\sum (x_i - \bar{x})^2}\] The standard error is: \[\text{SE}(b_1) = \sqrt{\frac{\sigma^2}{\sum (x_i - \bar{x})^2}}\] Since \(\sigma^2\) is unknown, we estimate it with the residual variance \(s^2\): \[s^2 = \frac{\sum (y_i - \hat{y}_i)^2}{n - 2}\] where \(\hat{y}_i = \bar{y} + b_1 (x_i - \bar{x})\) are the fitted values, and \(n - 2\) accounts for the degrees of freedom (two parameters estimated: \(\beta_0\) and \(\beta_1\)). Thus: \[\text{SE}(b_1) = \frac{s}{\sqrt{\sum (x_i - \bar{x})^2}}\] So the t-statistic is:
The residual sum of squares is: \[\sum (y_i - \hat{y}_i)^2 = \sum (y_i - \bar{y} - b_1 (x_i - \bar{x}))^2\]
To connect this to \(r\), consider the proportion of variance explained by the regression. The coefficient of determination \(r^2\) (for simple linear regression, this is the square of the correlation coefficient) is: \[r^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}\] where:
\(\text{SSR} = \sum (\hat{y}_i - \bar{y})^2\) (sum of squares due to regression),
\(\text{SSE} = \sum (y_i - \hat{y}_i)^2\) (sum of squares of errors),
\(\text{SST} = \sum (y_i - \bar{y})^2\) (total sum of squares).
Since \(\hat{y}_i - \bar{y} = b_1 (x_i - \bar{x})\), we have:
Our inquiry started with a “ratio of variances.” In the context of the t-test, the t-statistic can be interpreted through the lens of explained versus unexplained variance. From the regression perspective: \[r^2 = \frac{\text{SSR}}{\text{SST}} = \frac{\text{Explained Variance}}{\text{Total Variance}}\] The unexplained variance is: \[1 - r^2 = \frac{\text{SSE}}{\text{SST}}\] The t-statistic can be related to the F-statistic for regression, where: \[F = \frac{\text{SSR}/1}{\text{SSE}/(n-2)} = \frac{r^2 / 1}{(1 - r^2)/(n-2)}\] For simple linear regression, the t-statistic for the slope is the square root of the F-statistic: \[t^2 = F\] Let’s compute: \[F = \frac{r^2 (n - 2)}{1 - r^2}\]\[t = \sqrt{F} = \sqrt{\frac{r^2 (n - 2)}{1 - r^2}} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}\] This matches our derived t-statistic, confirming that the ratio of explained to unexplained variance underpins the test.
Final Answer
The t-value for testing the significance of the Pearson correlation coefficient \(r\), derived from the perspective of variances in a regression framework, is:
\[t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}\] This formula arises from the ratio of explained to unexplained variance in the regression model, where \(r^2\) represents the proportion of variance in \(Y\) explained by \(X\), and \(1 - r^2\) represents the unexplained variance, adjusted by the degrees of freedom \(n - 2\).
We can now compute the t-statistic using the formula: \[
t_{statistic} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}}
\] We can look up the critical value of t from the t-distribution with \(df = 463\) at a probability of error of \(5\%\):
We see that the p-value is very small, and we can reject the null hypothesis of “no correlation” between son and father heights.
H. Plotting the t-distribution
Code
mosaic::xqt(p =c(0.025, 0.975), df =463,return =c("plot"), alpha =0.5,colour ="black", system ="gg") %>%gf_vline(xintercept = t_statistic, color ="red") %>%gf_vline(xintercept = t_critical, color ="blue") %>%gf_annotate(geom ="label", x = t_statistic, y =0.3,label ="t-statistic", colour ="purple", size =8 ) %>%gf_annotate(geom ="label", x = t_critical, y =0.3,label ="t-critical", colour ="purple", size =8 ) %>%gf_labs(title ="t-distribution with df = 463",subtitle ="Sons' Heights vs Fathers' Heights",x ="t-value", y ="Density" ) %>%gf_refine(scale_y_continuous(expand =expansion(mult =c(0, 0.1))),scale_x_continuous(breaks =c(-3, -2, -1, 0, 1, 2, 3, t_statistic, t_critical, 10),limits =c(-4, 10),labels = scales::number_format(accuracy =0.01) ) )###df_corr <- cor_son_pearson$parametermean_value <- cor_son_pearson$estimategf_fun(dt(x = (x - mean_value) /sqrt(df_corr / (df_corr -2)),df = df_corr) * (1/sqrt(df_corr / (df_corr -2))) ~ x, xlim =c(-5, 5)) %>%gf_labs(title ="t-Distribution with Non-Zero Mean",subtitle ="Sons' Heights vs Fathers' Heights",x ="Correlation Estimate", y ="Density" ) %>%gf_vline(xintercept = mean_value, color ="red") %>%gf_annotate(geom ="label", x = mean_value, y =0.3,label ="Corr Estimate", colour ="purple", size =8 ) %>%gf_vline(xintercept = cor_son_pearson$conf.low, color ="blue", linetype ="dashed") %>%gf_vline(xintercept = cor_son_pearson$conf.high, color ="blue", linetype ="dashed") %>%gf_annotate("segment",x = cor_son_pearson$conf.low,xend = cor_son_pearson$conf.high, y =0.25, yend =0.25,arrow =arrow(ends ="both", length =unit(0.2, "inches")), color ="blue", size =1 )
(a) t-statistic and t-critical marked
(b) Estimate and Confidence Intervals
Figure 2: t-distribution with df = 463
We can of course use a randomization based test for correlation. How would we mechanize this, what aspect would be randomize?
Correlation is calculated on a vector-basis: each individual observation of variable#1 is multiplied by the corresponding observation of variable#2. Look at Equation 1! So we might be able to randomize the order of this multiplication to see how uncommon this particular set of multiplications are. That would give us a p-value to decide if the observed correlation is close to the truth. So, onwards with our friend mosaic:
We see that will all permutations of father, we are never able to hit the actual obs_daughter_corr! Hence there is a definite correlation between father height and daughter height.
The premise here is that many common statistical tests are special cases of the linear model. A linear model estimates the relationship between dependent variable or
“response” variable height and an explanatory variable or “predictor”, father. It is assumed that the relationship is linear. \(\beta_0\) is the intercept and \(\beta_1\) is the slope of the linear fit, that predicts the value of height based the value of father.
\[
height = \beta_0 + \beta_1 \times father
\] The model for Pearson Correlation tests is exactly the Linear Model:
Why are the respective \(r\)-s and \(\beta_1\)-s different, though the p-value-s is suspiciously the same!? Did we miss a factor of \(\frac{sd(son/daughter)}{sd(father)} = ??\) somewhere…??
Let us scale the variables to within {-1, +1} : (subtract the mean and divide by sd) and re-do the Linear Model with scaled versions of height and father:
# Scaled linear modellin_scaled_galton_daughters <-lm(scale(daughter) ~1+scale(father), data = Galton_daughters) %>% broom::tidy() %>%mutate(term =c("beta_0", "beta_1"))lin_scaled_galton_daughters
Now you’re talking!! The estimate is the same in both the classical test and the linear model! So we conclude:
When both target and predictor have the same standard deviation, the slope from the linear model and the Pearson correlation are the same.
There is this relationship between the slope in the linear model and Pearson correlation:
\[
Slope\ \beta_1 = \frac{sd_y}{sd_x} * r
\]
The slope is usually much more interpretable and informative than the correlation coefficient.
Hence a linear model using scale() for both variables will show slope = r.
Slope_Scaled: 0.4587605 = Correlation: 0.4587605
Finally, the p-value for Pearson Correlation and that for the slope in the linear model is the same (\(0.04280043\)). Which means we cannot reject the NULL hypothesis of “no relationship” between daughter-s and father-s heights.
Can you complete this for the sons?
Case Study #2: Study and Grades
In some cases the LINE assumptions may not hold.
Nonlinear relationships, non-normally distributed data ( with large outliers ) and working with ordinal rather than continuous data: these situations necessitate the use of Spearman’s ranked correlation scores. (Ranked, not sign-ranked.).
See the example below: We choose to look at the gpa_study_hours dataset. It has two numeric columns gpa and study_hours:
ggplot2::theme_set(new =theme_custom())ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) +geom_point() +geom_smooth() +labs(title ="GPA vs Study Hours",subtitle ="Pearson Correlation Test" )
Hmm…not normally distributed, and there is a sort of increasing relationship, however is it linear? And there is some evidence of heteroscedasticity, so the LINE assumptions are clearly in violation. Pearson correlation would not be the best idea here.
Let us quickly try it anyway, using a Linear Model for the scaledgpa and study_hours variables, from where we get:
# Pearson Correlation as Linear Modelmodel_gpa <-lm(scale(gpa) ~1+scale(study_hours), data = gpa_study_hours)##model_gpa %>% broom::tidy() %>%mutate(term =c("beta_0", "beta_1")) %>%cbind(confint(model_gpa) %>%as_tibble()) %>%select(term, estimate, p.value, `2.5 %`, `97.5 %`)
The correlation estimate is \(0.133\); the p-value is \(0.065\) (and the confidence interval includes \(0\)).
Hence we fail to reject the NULL hypothesis that study_hours and gpa have no relationship. But can this be right?
Should we use another test, that does not need the LINE assumptions?
“Signed Rank” Values
Most statistical tests use the actual values of the data variables. However, in some non-parametric statistical tests, the data are used in rank-transformed sense/order. (In some cases the signed-rank of the data values is used instead of the data itself.)
Signed Rank is calculated as follows:
Take the absolute value of each observation in a sample
Place the ranks in order of (absolute magnitude). The smallest number has rank = 1 and so on. This gives is ranked data.
Give each of the ranks the sign of the original observation ( + or -). This gives us signed ranked data.
signed_rank <-function(x) {sign(x) *rank(abs(x))}
Plotting Original and Signed Rank Data
Let us see how this might work by comparing data and its signed-rank version…A quick set of plots:
So the means of the ranks three separate variables seem to be in the same order as the means of the data variables themselves.
How about associations between data? Do ranks reflect well what the data might?
The slopes are almost identical, \(0.25\) for both original data and ranked data for \(y1\sim x\). So maybe ranked and even sign_ranked data could work, and if it can work despite LINE assumptions not being satisfied, that would be nice!
How does Sign-Rank data work?
TBD: need to add some explanation here.
Spearman correlation = Pearson correlation using the rank of the data observations. Let’s check how this holds for a our x and y1 data:
When ranks are used, the slope of the linear model (\(\beta_1\)) has the same value as the Spearman correlation coefficient ( \(\rho\) ).
Note that the slope from the linear model now has an intuitive interpretation: the number of ranks y changes for each change in rank of x. ( Ranks are “independent” of sd )
Example
We examine the cars93 data, where the numeric variables of interest are weight and price.
ggplot2::theme_set(new =theme_custom())cars93 %>%ggplot(aes(weight, price)) +geom_point() +geom_smooth(method ="lm", se =FALSE, lty =2) +labs(title ="Car Weight and Car Price have a nonlinear relationship") +theme_classic()
Let us try a Spearman Correlation score for these variables, since the data are not linearly related and the variance of price also is not constant over weight
# Using linear Modellm(rank(price) ~rank(weight), data = cars93) %>%summary()
Call:
lm(formula = rank(price) ~ rank(weight), data = cars93)
Residuals:
Min 1Q Median 3Q Max
-20.0676 -3.0135 0.7815 3.6926 20.4099
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.22074 2.05894 1.564 0.124
rank(weight) 0.88288 0.06514 13.554 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.46 on 52 degrees of freedom
Multiple R-squared: 0.7794, Adjusted R-squared: 0.7751
F-statistic: 183.7 on 1 and 52 DF, p-value: < 2.2e-16
# Stats Plotggstatsplot::ggscatterstats(data = cars93, x = weight,y = price,type ="nonparametric",title ="Cars93: Weight vs Price",subtitle ="Spearman Correlation")
We see that using ranks of the price variable, we obtain a Spearman’s \(\rho = 0.882\) with a p-value that is very small. Hence we are able to reject the NULL hypothesis and state that there is a relationship between these two variables. The linear relationship is evaluated as a correlation of 0.882.
# Other ways using other packagesmosaic::cor_test(gpa ~ study_hours, data = gpa_study_hours) %>% broom::tidy() %>%select(estimate, p.value, conf.low, conf.high)
Correlation tests are useful to understand the relationship between two variables, but they do not imply causation. A high correlation does not mean that one variable causes the other to change. It is essential to consider the context and other factors that may influence the relationship.
Correlations also become an important thing to evaluate in Linear Regression.
Conclusion
Correlation tests are a powerful way to understand the relationship between two variables. They can be performed using classical methods like Pearson and Spearman correlation, or using more robust methods like permutation tests. The linear model approach provides a deeper understanding of the relationship, especially when the assumptions of normality and homoscedasticity are met.
Your Turn
Try the datasets in the infer package. Use data(package = "infer") in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g. babynames) to read it.
Same with the resampledata and resampledata3 packages.
Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno, and Christopher Barr. 2024. openintro: Datasets and Supplemental Functions from “OpenIntro” Textbooks and Labs. https://doi.org/10.32614/CRAN.package.openintro.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Brenton M. Wiernik, Etienne Bacher, Rémi Thériault, and Dominique Makowski. 2022. “easystats: Framework for Easy Statistical Modeling, Visualization, and Reporting.”CRAN. https://doi.org/10.32614/CRAN.package.easystats.
Patil, Indrajeet. 2021a. “statsExpressions: R Package for Tidy Dataframes and Expressions with Statistical Details.”Journal of Open Source Software 6 (61): 3236. https://doi.org/10.21105/joss.03236.
———. 2021b. “Visualizations with statistical details: The ‘ggstatsplot’ approach.”Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.