Inference for a Single Mean

“The more I love humanity in general, the less I love man in particular. ― Fyodor Dostoyevsky, The Brothers Karamazov

Arvind V.

2022-11-10

…neither let us despair over how small our successes are. For however much our successes fall short of our desire, our efforts aren’t in vain when we are farther along today than yesterday.

— John Calvin

Setting up R packages

library(tidyverse)
library(mosaic)
library(ggformula)
library(infer)
library(broom) # Clean test results in tibble form
library(resampledata) # Datasets from Chihara and Hesterberg's book
library(openintro) # More datasets
library(visStatistics) # One package to rule them all
library(ggstatsplot)

Plot Fonts and Theme

Code
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    # theme(panel.widths = unit(11, "cm"),
    #       panel.heights = unit(6.79, "cm")) + # Golden Ratio

    theme(
      plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
      plot.background = element_rect(
        fill = "bisque",
        colour = "black",
        linewidth = 1
      )
    ) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 10
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 8
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

Introduction

In this module, we will answer a basic Question: What is the mean \(\mu\) of the population?

Recall that the mean is the first of our Summary Statistics. We wish to know more about the mean of the population from which we have drawn our data sample.

We will do this is in several ways, based on the assumptions we are willing to adopt about our data. First we will use a toy dataset with one “imaginary” sample, normally distributed and made up of 50 observations. Since we “know the answer” we will be able to build up some belief in the tests and procedures, which we will dig into to form our intuitions.

We will then use a real-world dataset to make inferences on the means of Quant variables therein, and decide what that could tell us.

Statistical Inference is almost an Attitude!

As we will notice, the process of Statistical Inference is an attitude: ain’t nothing happenin’! We look at data that we might have received or collected ourselves, and look at it with this attitude, seemingly, of some disbelief. Here is how we proceed:

  1. We state our belief that:

    • there is really nothing happening with our Research Question, and there is no reason to believe in it;
    • the sample is just one of many, and the value/statistic indicated by the sample is the outcome of random chance (NULL Belief / Hypothesis).
  2. We then calculate how slim the chances (Relative Probability! Remember we are Frequentists!) are of the given data sample / statistic showing up like that, given our NULL belief. It is a distance measurement of sorts, literally embodying the idea of far-fetched-ness!!

  3. If the chances of the sample are too low, then that might alter our belief. This is the attitude that lies at the heart of Null Hypothesis Significance Testing (NHST).

Important

The calculation of chances is both a logical, and a possible procedure since we are dealing with samples from a population, and imagining a distribution of sample statistics. If many other samples give us quite different statistics, then we would discredit the one we derive from it.

Each test we perform will mechanize this attitude in different ways, based on assumptions and conveniences. (And history)

Case Study #1: Toy data

Since the CLT assumes the sample is normally-distributed, let us generate a sample that is just so:

set.seed(42) # for replication
#
# Data as individual vectors
# ( for t.tests etc)
# Generate normally distributed data with mean = 2, sd = 2, length = 50
y <- rnorm(n = 50, mean = 2, sd = 2)

# And as tibble too
mydata <- tibble(y = y)
mydata

Inspecting and Charting Data

## Set Plot theme
ggplot2::theme_set(new = theme_custom())

mydata %>%
  gf_density(~y) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_labs(
    title = "Densities of Original Data Variables",
    subtitle = "Compared with Normal Density"
  )
Figure 1: Toy Data Distribution

Observations from Density Plots

  • The variable \(y\) appear to be centred around
  • It does not seem to be normally distributed…
  • So assumptions are not always valid…

Research Question

Research Questions are always about the population! Here goes:

Research Question

Could the mean of the population \(\mu\), from which the sample y has been drawn, be \(0\)?

Assumptions

Testing for Normality

The y-variable does not appear to be normally distributed. This would affect the test we can use to make inferences about the population mean.
There are formal tests for normality too. We will do them in the next case study. For now, let us proceed naively.

Inference

Case Study #2: Exam data

Let us now choose a dataset from the openintro package:

data("exam_grades", package = "openintro")
exam_grades

Research Question

There are quite a few Quant variables in the data. Let us choose course_grade as our variable of interest. What might we wish to find out?

Research Question

In general, the Teacher in this class is overly generous with grades unlike others we know, and so the average course-grade is equal to 80% !!

Inspecting and Charting Data

ggplot2::theme_set(new = theme_custom())

exam_grades %>%
  gf_density(~course_grade) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_labs(
    title = "Density of Course Grade",
    subtitle = "Compared with Normal Density"
  )
Figure 5: Exam Grades Distribution

Hmm…data looks normally distributed. But this time we will not merely trust our eyes, but do a test for it.

Testing Assumptions in the Data

Is the data normally distributed?

stats::shapiro.test(x = exam_grades$course_grade) %>%
  broom::tidy()

The Shapiro-Wilks Test tests whether a data variable is normally distributed or not. Without digging into the maths of it, let us say it makes the assumption that the variable is so distributed and then computes the probability of how likely this is. So a high p-value (\(0.47\)) is a good thing here.

When we have large Quant variables ( i.e. with length >= 5000), the function shapiro.test does not work, and we use an Anderson-Darling1 test to confirm normality:

library(nortest)
# Especially when we have >= 5000 observations
nortest::ad.test(x = exam_grades$course_grade) %>%
  broom::tidy()

So course_grade is a normally-distributed variable. There are no exceptional students! Hmph! Peasants.

Inference

Workflow for Inference for a Single Mean

A series of tests deal with one mean value of a sample. The idea is to evaluate whether that mean is representative of the mean of the underlying population. Depending upon the nature of the (single) variable, the test that can be used are as follows:

flowchart TD
    A[Inference for Single Mean] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n or\n Anderson-Darling Test]
    B --> C{OK?}
    C -->|Yes\n Parametric| D[t.test]
    C -->|No\n Non-Parametric| E[wilcox.test]
    E <--> G[t.test\n with\n Signed-Ranks of Data]
    C -->|No\n Non-Parametric| P[Bootstrap]
    C -->|No\n Non-Parametric| Q[Permutation]
 

Wait, But Why?

  • We can only sample from a population, and calculate sample statistics
  • But we still want to know about population parameters
  • All our tests and measures of uncertainty with samples are aimed at obtaining a confident measure of a population parameter.
  • Means are the first on the list!

Conclusion

  • If samples are normally distributed, we use a t.test.
  • Else we try non-parametric tests such as the Wilcoxon test.
  • Since we now have compute power at our fingertips, we can leave off considerations of normality and simply proceed with either a permutation or a boostrap test.

References

  1. OpenIntro Modern Statistics, Chapter #17
  2. Bootstrap based Inference using the infer package: https://infer.netlify.app/articles/t_test
  3. Michael Clark & Seth Berry. Models Demystified: A Practical Guide from t-tests to Deep Learning. https://m-clark.github.io/book-of-models/
  4. University of Warwickshire. SAMPLING: Searching for the Approximation Method use to Perform rational inference by Individuals and Groups. https://sampling.warwick.ac.uk/#Overview

Additional Readings

  1. https://mine-cetinkaya-rundel.github.io/quarto-tip-a-day/posts/21-diagrams/
R Package Citations
Package Version Citation
ellmer 0.3.2 Wickham et al. (2025)
infer 1.0.9 Couch et al. (2021)
openintro 2.5.0 Çetinkaya-Rundel et al. (2024)
resampledata 0.3.2 Chihara and Hesterberg (2018)
statlingua 0.1.0 Greenwell (2025)
TeachHist 0.2.1 Lange (2023)
TeachingDemos 2.13 Snow (2024)
Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno, and Christopher Barr. 2024. openintro: Datasets and Supplemental Functions from OpenIntro Textbooks and Labs. https://doi.org/10.32614/CRAN.package.openintro.
Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. John Wiley & Sons Hoboken NJ. https://github.com/lchihara/MathStatsResamplingR?tab=readme-ov-file.
Couch, Simon P., Andrew P. Bray, Chester Ismay, Evgeni Chasnovski, Benjamin S. Baumer, and Mine Çetinkaya-Rundel. 2021. infer: An R Package for Tidyverse-Friendly Statistical Inference.” Journal of Open Source Software 6 (65): 3661. https://doi.org/10.21105/joss.03661.
Greenwell, Brandon M. 2025. statlingua: Explain Statistical Output with Large Language Models. https://doi.org/10.32614/CRAN.package.statlingua.
Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://doi.org/10.32614/CRAN.package.TeachHist.
Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://doi.org/10.32614/CRAN.package.TeachingDemos.
Wickham, Hadley, Joe Cheng, Aaron Jacobs, Garrick Aden-Buie, and Barret Schloerke. 2025. ellmer: Chat with Large Language Models. https://doi.org/10.32614/CRAN.package.ellmer.

An AI Generated Explanation of the Statistical Models

We have used the statlingua package to generate an AI explanation of the statistical model t4 we have used in this module. statlingua interfaces to the ellmer package, which in turn interfaces to the ollama API for AI explanations. The AI model used is llama3.1, and the explanation is tailored for a novice audience with moderate verbosity.

AI Generated Explanation

The command was run in the Console and the results pasted into this Quarto document. I have not edited this at all. Do look for complex verbiage and outright hallucinatory responses!!

Code
library(ellmer)
library(statlingua)
library(chores)
client <- ellmer::chat_ollama(model = "llama3.1", echo = FALSE)
###

exam_context <- "
The model uses a data set on child exam grades from the openintro package in R.
The hypothesis is that the average course grade is 80%. The model includes a t-test and a Wilcoxon signed-rank test to assess whether the mean of the course grades significantly differs from this value.
The variables are:
- `course_grade`: The final grade in the course, which is a numeric variable.
- `semester`: The semester in which the course was taken, which is a factor variable.
- `sex`: Sex of the student ( Man / Woman )
- ` exam1`, `exam2`, `exam3`: Scores on three exams, which are numeric variables.
The model also includes a permutation test and a bootstrap test to further assess the hypothesis."

explanation4 <- statlingua::explain(t4,
  client = client,
  context = exam_context,
  audience = "novice", verbosity = "moderate"
) # moderate / detailed

One-Sample t-test Output Interpretation

This is a One-sample t-test output summary from R, comparing the mean course grade (course_grade) against 80%. Key elements are outlined below:

Estimate, Statistic, p-value, and N/A Parameter

  • estimate (72.2):
    • Indicates the estimated sample mean of course_grade, in this case approximately 72.2.
    • This suggests that on average, across all examined children, their final grades are below 80%.
  • statistic (-12.1):
    • The large negative value indicates how many standard deviations away from the hypothesized mean (\(80\)) our observed estimate is; this can help us grasp just how ‘unexpected’ our outcome really is.
    • This large difference would usually correspond to a very small p-value under \(H_0\), which supports rejecting it.
  • p-value (2.19e-26):
    • Extremely low p-values (usually below 0.05 or even .001) indicate that the observed effect or deviation from the expected value is statistically significant.
    • In this context, the hypothesis of 80% being the true mean can be strongly rejected since achieving such a grade as often as observed purely by chance for our sample size is incredibly unlikely.
  • parameter (232):
    • The parameter usually refers to the degrees of freedom used in the calculation of test statistics.
    • However, since we’re looking at one sample t-test results here it likely indicates that there are 231 data points or more as ‘available’. A large such number is expected given that often this statistic simply becomes irrelevant to further inferences about such hypotheses (especially those based on a normal distribution assumption), because for larger n approaching population size all t-statistics should effectively be treated identically and become useless for exact confidence intervals etc..

Confidence Intervals

  • conf.low and conf.high: These provide bounds within which the true mean can be estimated not to lie with 95% probability, given our understanding of the test’s results so far.

Additional Details

  • method: Often ‘One Sample’, as seen in this example. One-sample tests are used when comparing an individual sample or a group of samples against a known population parameter.
  • alternative: This denotes the directionality assumption during hypothesis testing; by default it is set to two-sided for most statistical tests, which means we want to determine whether our result (if any) contradicts H0 in either direction (above or below, greater or smaller). There are one-tailed tests too though which look at differences only above (> or <), not both sides.

Remember, given the small p-value (.0022e+26 being effectively zero), you can reject H0 very safely and conclude that observed exam grade averages definitely differ significantly from what we’d expect if things were as described in our model according to a normal probability distribution. The large t-statistic (-12, indicating substantial deviation) confirms this.