Summaries

Throwing away data to grasp it

Arvind V.

2023-10-15

“Love is like quicksilver in the hand. Leave the fingers open and it stays. Clutch it, and it darts away.”

— Dorothy Parker, author (22 Aug 1893-1967)

Setting up R Packages

library(tidyverse)
library(mosaic) # Our all-in-one package
library(skimr) # Looking at data
library(janitor) # Clean the data
library(naniar) # Handle missing data
library(visdat) # Visualise missing data
library(tinytable) # Printing Static Tables for our data
library(DT) # Interactive Tables for our data
library(crosstable) # Multiple variable summaries

Plot Fonts and Theme

Code
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    # theme(panel.widths = unit(11, "cm"),
    #       panel.heights = unit(6.79, "cm")) + # Golden Ratio

    theme(
      plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
      plot.background = element_rect(
        fill = "bisque",
        colour = "black",
        linewidth = 1
      )
    ) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 10
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 8
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

How do we Grasp Data?

We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data. Then we learnt to Inspect Data to get a feel of the data, and to understand what the variables meant. We also cleaned up the data and arrived a freshly-minted dataset, ready for analysis.

However, despite this inspection, understanding, and cleaning, the actual data remains elusive for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.

The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.

Reduction is Addition

Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.

Steven Stigler (2016) is the author of the book “The Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation

The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, “The Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier day—and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Throwing Away Data with Brad Pitt

Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.

Literacy in the USA

And then, an example from a more sombre story:

Table 1: US Adults Literacy and Numeracy Skills
Year Below Level #1 Level #1 Level #2 Level #3 Levels #4 and #5
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.
Number in millions (2012/2014) 8.35 26.5 65.1 71.4 26.6
Number in millions (2017) 7.59 29.2 66.1 68.8 26.7

This ghastly-looking Table 1 depicts U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen, the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

Why Summarize?

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane.

To summarize is to understand.

Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.

It is also a means of registering surprise: some of our first Questions about the data arise from an inspection of data summaries.

And if we don’t summarise?

Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was Big Data without Summary Statistics.

What graphs / numbers will we see today?

Variable #1 Variable #2 Chart Names “Chart Shape”
All All Tables and Stat Measures

What are Summaries?

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

  • Dataset length: How many rows/observations?
  • Dataset breadth: How many columns/variables?
  • How many Quant variables?
  • How many Qual variables?
  • Quant variables: min, max, mean, median, sd
  • Qual variables: levels, counts per level
  • Both: means, medians for each level of a Qual variable…

How do these Summaries Work?

Quant Variable Summaries

Quant variables: Inspecting the base::min, base::max, mean, median, variance and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error!

Comparing two Quant variables for their ranges also tells us that we may have to \(scale/normalize\) them for computational ease, if one variable has large numbers and the other has very small ones.

Qual Variable Summaries

Qual variables: With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these.

Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.

Joint Summaries

Together?: We can use Quant and Qual together, to develop the above summaries (min, max,mean, median and sd) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts.

This will tell us if our (sample) dataset already shows quantitative differences between sub-classes in the population.

Simpson’s Paradox, Missing Data, and Imputation

And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!

For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row).

Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.

Some Quick Summary Definitions

Mean

The sample mean, or average, of a Quantitative data variable can be calculated as the sum of the observed values divided by the number of observations:

\[ \large{mean = \bar{x} = \frac{x_1 + x_2+ x_3....+x_n}{n}} \]

Variance and Standard Deviation

Observations can be on either side of the mean, naturally. To measure the extent of these differences, we square and sum the differences between individual values and their mean, and take their average to obtain the (sample) variance:

\[ \large{variance = s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + (x_2 - \bar{x})^2 +...(x_n - \bar{x})^2}{n-1}} \]

The standard deviation \(s\) is just the square root of the variance. (The \(n-1\) is a mathematical nuance to allow for the fact that we have used the data to calculate the mean before we get to \(s^2\), and hence have “used up” one degree of randomness in the data. It gets us more robust results.)

Median

When the observations in a Quant variable are placed in order of their magnitude (i.e. rank), the observation in the middle is the median.

Half the observations are below, and half are above, the median.

Case Study: DocVisits

We will (again) use this superb repository of datasets created by Vincent Arel-Bundock. Let us choose a modest-sized dataset, say this dataset on Doctor Visits, which is available online here and read it into R. We will clean it, munge it, and prepare it in one shot with everything we learnt in the Inspect Data module.

Read the Data

Table 2: Doctor Visits Dataset
docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(docVisits)
Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ gender    <chr> "female", "female", "male", "male", "male", "female", "femal…
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
$ private   <chr> "yes", "yes", "no", "no", "no", "no", "no", "no", "yes", "ye…
$ freepoor  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ freerepat <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ nchronic  <chr> "no", "no", "no", "no", "yes", "yes", "no", "no", "no", "no"…
$ lchronic  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

So 5190 rows and 13 columns. Several variables are of class character (gender, private, freepoor, freerepat, nchronic, lchronic ) and several are double (visits, age, income, illness, reduced, health).

Data Cleaning and Munging

We will first clean the data, and then modify it to make it ready for analysis.

Code
docVisits_modified <- docVisits %>%
  # Replace common NA strings and numbers with actual NA
  naniar::replace_with_na_all(condition = ~ .x %in% common_na_strings) %>%
  naniar::replace_with_na_all(condition = ~ .x %in% common_na_numbers) %>%
  # Clean variable names
  janitor::clean_names(case = "snake") %>% # clean names

  # Convert character variables to factors
  mutate(
    gender = as_factor(gender),
    private = as_factor(private),
    freepoor = as_factor(freepoor),
    freerepat = as_factor(freerepat),
    nchronic = as_factor(nchronic),
    lchronic = as_factor(lchronic)
  ) %>%
  # arrange the character variables first
  dplyr::relocate(where(is.factor), .after = rownames)


docVisits_modified %>% glimpse()
Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ gender    <fct> female, female, male, male, male, female, female, female, fe…
$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, no, …
$ freepoor  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, no,…
$ nchronic  <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, yes, …
$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …

Final Clean Data Table

Code
docVisits_modified %>%
  DT::datatable(
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: left; color: black; font-size: 150%;",
      "Doctor Visits Dataset (Clean)"
    ),
    options = list(pageLength = 10, autoWidth = TRUE)
  ) %>%
  DT::formatStyle(
    columns = names(docVisits_modified),
    fontFamily = "Roboto Condensed",
    fontSize = "12px"
  )
Table 3: Doctor Visits Dataset (Clean)

Data Dictionary

We can set up the Data Dictionary from the website describing the data:

Variable Description
visits Number of doctor visits in past 2 weeks.
gender Factor indicating gender.
age Age in years divided by 100.
income Annual income in tens of thousands of dollars.
illness Number of illnesses in past 2 weeks.
reduced Number of days of reduced activity in past 2 weeks due to illness or injury.
health General health questionnaire score using Goldberg’s method.
private Factor. Does the individual have private health docVisits?
freepoor Factor. Does the individual have free government health docVisits due to low income?
freerepat Factor. Does the individual have free government health docVisits due to old age, disability or veteran status?
nchronic Factor. Is there a chronic condition not limiting activity?
lchronic Factor. Is there a chronic condition limiting activity?

Summarise the Data

We now proceed to extract summary statistics from the data. We will first work with individual variables, and then use sensible combinations to summarize with, based on our understanding of the variables involved.

We will use:

  • Overall view: skimr::skim(), mosaic::inspect(), and dplyr::glimpse()
  • Qual variables: dplyr::count()
  • Quant variables: dplyr::summarise()
  • Both together: dplyr::group_by() + dplyr::summarize(); and crosstable::crosstable()

to develop our intuitions.

Overall View of Data

Summarise Qual Variables

Summarise Quant Variables

How about summaries for Quant variables?

Grouped Summaries

Why Grouped Summaries?

We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?

Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group. This will give us an idea whether different segments of the population, as defined by Qual variables and their levels, are relatively similar, or if there are significant differences between groups.

Creating Group Summaries

Summaries and Uncertainty

So, are we sure these summaries speak the truth? Are they accurate? Do they really represent a truth about the population from which this data sample was drawn?

We will need to deal with these ideas when we get to Inferential Statistics. For now, we will just note that the summaries we have obtained are sample statistics. We will need to perform some analysis to understand how well these sample statistics represent the population statistics.

More on dplyr

The {dplyr} package is capable of doing much more than just count, group_by and summarize. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on {dplyr} is here:

dplyr Tutorial

Your Turn

  1. Star Trek Books
  2. Math Anxiety! Hah!
  3. Cardio Data Sets
  4. Neuro Data Sets
  5. Datasets from the Lock5 Textbook(Pruim (2015), Lock (2021))

Note

Which would be the Group By variables here? And what would you summarize? With which function?

Note

```{r}
library(CardioDataSets)
data(package = "CardioDataSets") # Lists datasets in the package
```
```{r}
library(NeuroDataSets)
data(package = "NeuroDataSets") # Lists datasets in the package
```
```{r}
library(Lock5Data)
library(Lock5withR)
data(package = "Lock5Data") # Lists datasets in the package
data(package = "Lock5withR") # Lists datasets in the package
```

Wait, But Why?

  • Data Summaries give you the essentials, without getting bogged down in the details(just yet).
  • Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
  • Summaries help evoke Questions and Hypotheses about the population, which may lead to inquiries, analysis, and insights
  • Grouped Summaries should tell you if:
    • counts of groups in your target audience are lopsided/imbalanced; Go and Get your data again.
    • there are visible differences in Quant data across groups, so your target audience could be nicely fractured;
    • etc.

Conclusion

  • mosaic::inspect(), skimr::skim() and dplyr::glimpse() give us an overall summary of our data.
  • Using dplyr::count() we can get counts of levels of Qual variables, and combinations of levels of multiple Qual variables.
  • With dplyr::summarise() we can get summary statistics of Quant variables, singly or in pairs, or even all together.
  • Using dplyr::group_by() we can group the data by levels of one or more Qual variables, and then use dplyr::summarise() to get summary statistics of Quant variables within each group.
  • crosstable::crosstable() can also be used to get grouped summaries of multiple Quant variables over Qual variables, using the formula interface.

Make these part of your Workflow.

AI Generated Summary and Podcast

This is a tutorial on using the R programming language to perform descriptive statistical analysis on data sets. The tutorial focuses on summarizing data using various R packages like {dplyr} and {crosstable}. It emphasizes the importance of understanding the data’s structure, identifying different types of variables (qualitative and quantitative), and calculating summary statistics such as means, medians, and frequencies. The tutorial provides examples using real datasets and highlights the significance of data summaries in gaining initial insights, formulating research questions, and identifying potential issues with the data.

References

  1. Lock, Lock, Lock, Lock, and Lock. (2021). Statistics: Unlocking the Power of Data, 3rd Edition). https://media.wiley.com/product_data/excerpt/69/11196821/1119682169-32.pdf

R Package Citations

Package Version Citation
CardioDataSets 0.2.0 Caceres Rossi (2025a)
crosstable 0.8.2 Chaltiel (2025)
janitor 2.2.1 Firke (2024)
Lock5Data 3.0.0 Lock (2021)
Lock5withR 1.2.2 Pruim (2015)
mosaic 1.9.2 Pruim, Kaplan, and Horton (2017)
NeuroDataSets 0.2.0 Caceres Rossi (2025b)
skimr 2.2.1 Waring et al. (2025)
Caceres Rossi, Renzo. 2025a. CardioDataSets: A Comprehensive Collection of Cardiovascular and Heart Disease Datasets. https://github.com/lightbluetitan/cardiodatasets.
———. 2025b. NeuroDataSets: A Comprehensive Collection of Neuroscience and Brain-Related Datasets. https://github.com/lightbluetitan/neurodatasets.
Chaltiel, Dan. 2025. crosstable: Crosstables for Descriptive Analyses. https://doi.org/10.32614/CRAN.package.crosstable.
Firke, Sam. 2024. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://doi.org/10.32614/CRAN.package.janitor.
Lock, Robin. 2021. Lock5Data: Datasets for Statistics: UnLocking the Power of Data. https://doi.org/10.32614/CRAN.package.Lock5Data.
Pruim, Randall. 2015. Lock5withR: Datasets for Statistics: Unlocking the Power of Data. https://github.com/rpruim/Lock5withR.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to Think with Data Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Stigler, Stephen M. 2016. “The Seven Pillars of Statistical Wisdom,” March. https://doi.org/10.4159/9780674970199.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2025. skimr: Compact and Flexible Summaries of Data. https://doi.org/10.32614/CRAN.package.skimr.