Summaries

Throwing away data to grasp it

Arvind V.

2023-10-15

“Love is like quicksilver in the hand. Leave the fingers open and it stays. Clutch it, and it darts away.”

— Dorothy Parker, author (22 Aug 1893-1967)

Setting up R Packages

library(tidyverse)
library(mosaic) # Our all-in-one package
library(skimr) # Looking at data
library(janitor) # Clean the data
library(naniar) # Handle missing data
library(visdat) # Visualise missing data
library(tinytable) # Printing Static Tables for our data
library(DT) # Interactive Tables for our data
library(crosstable) # Multiple variable summaries

Plot Fonts and Theme

Code

library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    # theme(panel.widths = unit(11, "cm"),
    #       panel.heights = unit(6.79, "cm")) + # Golden Ratio

    theme(
      plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
      plot.background = element_rect(
        fill = "bisque",
        colour = "black",
        linewidth = 1
      )
    ) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 10
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 8
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

How do we Grasp Data?

We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data. Then we learnt to Inspect Data to get a feel of the data, and to understand what the variables meant. We also cleaned up the data and arrived a freshly-minted dataset, ready for analysis.

However, despite this inspection, understanding, and cleaning, the actual data remains elusive for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.

The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.

Reduction is Addition

Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.

Steven Stigler (2016) is the author of the book “The Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation

The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, “The Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier day—and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Throwing Away Data with Brad Pitt

Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.

Literacy in the USA

And then, an example from a more sombre story:

Table 1: US Adults Literacy and Numeracy Skills

Year	Below Level #1	Level #1	Level #2	Level #3	Levels #4 and #5
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.
Number in millions (2012/2014)	8.35	26.5	65.1	71.4	26.6
Number in millions (2017)	7.59	29.2	66.1	68.8	26.7

This ghastly-looking Table 1 depicts U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen, the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

Why Summarize?

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane.

To summarize is to understand.

Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.

It is also a means of registering surprise: some of our first Questions about the data arise from an inspection of data summaries.

And if we don’t summarise?

Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was Big Data without Summary Statistics.

What graphs / numbers will we see today?

Variable #1	Variable #2	Chart Names	“Chart Shape”
All	All	Tables and Stat Measures

What are Summaries?

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

Dataset length: How many rows/observations?
Dataset breadth: How many columns/variables?
How many Quant variables?
How many Qual variables?
Quant variables: min, max, mean, median, sd
Qual variables: levels, counts per level
Both: means, medians for each level of a Qual variable…

How do these Summaries Work?

Quant Variable Summaries

Quant variables: Inspecting the base::min, base::max, mean, median, variance and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error!

Comparing two Quant variables for their ranges also tells us that we may have to \(scale/normalize\) them for computational ease, if one variable has large numbers and the other has very small ones.

Qual Variable Summaries

Qual variables: With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these.

Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.

Joint Summaries

Together?: We can use Quant and Qual together, to develop the above summaries (min, max,mean, median and sd) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts.

This will tell us if our (sample) dataset already shows quantitative differences between sub-classes in the population.

Simpson’s Paradox, Missing Data, and Imputation

And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!

For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row).

Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.

Some Quick Summary Definitions

Mean

The sample mean, or average, of a Quantitative data variable can be calculated as the sum of the observed values divided by the number of observations:

\[ \large{mean = \bar{x} = \frac{x_1 + x_2+ x_3....+x_n}{n}} \]

Variance and Standard Deviation

Observations can be on either side of the mean, naturally. To measure the extent of these differences, we square and sum the differences between individual values and their mean, and take their average to obtain the (sample) variance:

\[ \large{variance = s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + (x_2 - \bar{x})^2 +...(x_n - \bar{x})^2}{n-1}} \]

The standard deviation \(s\) is just the square root of the variance. (The \(n-1\) is a mathematical nuance to allow for the fact that we have used the data to calculate the mean before we get to \(s^2\), and hence have “used up” one degree of randomness in the data. It gets us more robust results.)

Median

When the observations in a Quant variable are placed in order of their magnitude (i.e. rank), the observation in the middle is the median.

Half the observations are below, and half are above, the median.

Case Study: DocVisits

We will (again) use this superb repository of datasets created by Vincent Arel-Bundock. Let us choose a modest-sized dataset, say this dataset on Doctor Visits, which is available online here and read it into R. We will clean it, munge it, and prepare it in one shot with everything we learnt in the Inspect Data module.

Read the Data

Table 2: Doctor Visits Dataset

docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")

Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(docVisits)

Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ gender    <chr> "female", "female", "male", "male", "male", "female", "femal…
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
$ private   <chr> "yes", "yes", "no", "no", "no", "no", "no", "no", "yes", "ye…
$ freepoor  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ freerepat <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ nchronic  <chr> "no", "no", "no", "no", "yes", "yes", "no", "no", "no", "no"…
$ lchronic  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

So 5190 rows and 13 columns. Several variables are of class character (gender, private, freepoor, freerepat, nchronic, lchronic ) and several are double (visits, age, income, illness, reduced, health).

Data Cleaning and Munging

We will first clean the data, and then modify it to make it ready for analysis.

Code

docVisits_modified <- docVisits %>%
  # Replace common NA strings and numbers with actual NA
  naniar::replace_with_na_all(condition = ~ .x %in% common_na_strings) %>%
  naniar::replace_with_na_all(condition = ~ .x %in% common_na_numbers) %>%
  # Clean variable names
  janitor::clean_names(case = "snake") %>% # clean names

  # Convert character variables to factors
  mutate(
    gender = as_factor(gender),
    private = as_factor(private),
    freepoor = as_factor(freepoor),
    freerepat = as_factor(freerepat),
    nchronic = as_factor(nchronic),
    lchronic = as_factor(lchronic)
  ) %>%
  # arrange the character variables first
  dplyr::relocate(where(is.factor), .after = rownames)


docVisits_modified %>% glimpse()

Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ gender    <fct> female, female, male, male, male, female, female, female, fe…
$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, no, …
$ freepoor  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, no,…
$ nchronic  <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, yes, …
$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …

Final Clean Data Table

Code

docVisits_modified %>%
  DT::datatable(
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: left; color: black; font-size: 150%;",
      "Doctor Visits Dataset (Clean)"
    ),
    options = list(pageLength = 10, autoWidth = TRUE)
  ) %>%
  DT::formatStyle(
    columns = names(docVisits_modified),
    fontFamily = "Roboto Condensed",
    fontSize = "12px"
  )

Table 3: Doctor Visits Dataset (Clean)

Data Dictionary

We can set up the Data Dictionary from the website describing the data:

Variable	Description
visits	Number of doctor visits in past 2 weeks.
gender	Factor indicating gender.
age	Age in years divided by 100.
income	Annual income in tens of thousands of dollars.
illness	Number of illnesses in past 2 weeks.
reduced	Number of days of reduced activity in past 2 weeks due to illness or injury.
health	General health questionnaire score using Goldberg’s method.
private	Factor. Does the individual have private health docVisits?
freepoor	Factor. Does the individual have free government health docVisits due to low income?
freerepat	Factor. Does the individual have free government health docVisits due to old age, disability or veteran status?
nchronic	Factor. Is there a chronic condition not limiting activity?
lchronic	Factor. Is there a chronic condition limiting activity?

Summarise the Data

We now proceed to extract summary statistics from the data. We will first work with individual variables, and then use sensible combinations to summarize with, based on our understanding of the variables involved.

We will use:

Overall view: skimr::skim(), mosaic::inspect(), and dplyr::glimpse()
Qual variables: dplyr::count()
Quant variables: dplyr::summarise()
Both together: dplyr::group_by() + dplyr::summarize(); and crosstable::crosstable()

to develop our intuitions.

Overall View of Data

We are familiar with dplyr::glimpse(), which gives us a quick overview of the data structure.

docVisits_modified %>% dplyr::glimpse()

Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ gender    <fct> female, female, male, male, male, female, female, female, fe…
$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, no, …
$ freepoor  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, no,…
$ nchronic  <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, yes, …
$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …

docVisits_modified %>% skimr::skim()

Data summary
Name	Piped data
Number of rows	5190
Number of columns	13
_______________________
Column type frequency:
factor	6
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
gender	1	FALSE	2	fem: 2702, mal: 2488
private	1	FALSE	2	no: 2892, yes: 2298
freepoor	1	FALSE	2	no: 4968, yes: 222
freerepat	1	FALSE	2	no: 4099, yes: 1091
nchronic	1	FALSE	2	no: 3098, yes: 2092
lchronic	1	FALSE	2	no: 4585, yes: 605

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rownames	3	1	2596.96	1497.58	1.00	1300.50	2597.00	3893.50	5190.00	▇▇▇▇▇
visits	0	1	0.30	0.80	0.00	0.00	0.00	0.00	9.00	▇▁▁▁▁
age	0	1	0.41	0.20	0.19	0.22	0.32	0.62	0.72	▇▂▁▂▅
income	0	1	0.58	0.37	0.00	0.25	0.55	0.90	1.50	▇▆▅▅▂
illness	0	1	1.43	1.38	0.00	0.00	1.00	2.00	5.00	▇▂▂▁▁
reduced	0	1	0.86	2.89	0.00	0.00	0.00	0.00	14.00	▇▁▁▁▁
health	0	1	1.22	2.12	0.00	0.00	0.00	2.00	12.00	▇▁▁▁▁

In addition to these, skimr::skim() gives a neat little histogram of the Quant variables, which is very useful to get a quick idea of the distribution of values in the variables. It can also be used to detect Quant Variables that are actually Qual Variables, detectable by a very limited set of bars in their histogram.

inspect_output <- docVisits_modified %>% mosaic::inspect()
inspect_output[[1]] %>% tt()

name	class	levels	n	distribution
gender	factor	2	5190	female (52.1%), male (47.9%)
private	factor	2	5190	no (55.7%), yes (44.3%)
freepoor	factor	2	5190	no (95.7%), yes (4.3%)
freerepat	factor	2	5190	no (79%), yes (21%)
nchronic	factor	2	5190	no (59.7%), yes (40.3%)
lchronic	factor	2	5190	no (88.3%), yes (11.7%)

inspect_output[[2]] %>% tt()

name	class	min	Q1	median	Q3	max	mean	sd	n	missing
rownames	numeric	1	1300	2597	3894	5190	2597	1498	5187	3
visits	numeric	0	0	0	0	9	0.3	0.8	5190	0
age	numeric	0.19	0.22	0.32	0.62	0.72	0.41	0.2	5190	0
income	numeric	0	0.25	0.55	0.9	1.5	0.58	0.37	5190	0
illness	numeric	0	0	1	2	5	1.4	1.4	5190	0
reduced	numeric	0	0	0	0	14	0.86	2.9	5190	0
health	numeric	0	0	0	2	12	1.2	2.1	5190	0

The other two functions give detailed summaries of the data, separately for Qual and Quant variables. These are mean,sd,median and quartiles for Quant variables, and levels and counts for Qual variables. missing data is also flagged.

We could think of visits as our response or target variable, and the rest as explanatory variables.
Mean visits are 0.3 and the distribution is very skewed to the right ( from skimr::skim() histogram). Most people do not visit the doctor in a 2-week period.
Among Qual variables, freepoor and lchronic have very unbalanced counts. Rest are reasonably balanced.
reduced shows that some people have been ill the entire preceding 2weeks. (reduced = 14 days)
health has a max Goldberg Score of \(12\), but the mean is a much lower \(1.21\), indicating that many in this dataset are not really feeling well.

Which are the important Quant and Qual variables for your study?
What are their units?
What are their ranges? Are these sensible, e.g. 120% for a variable that is a percentage?
What are the *levels of the Qual variables of interest? Are they too many ( e.g. 34 manufacturers in the mtcars dataset? Or too few?
Understand the means and variances. Do they make sense?
Could any relevant variable be missing altogether?

Summarise Qual Variables

Code

## Counting by the obvious factor variables
docVisits_modified %>%
  dplyr::count(gender) %>%
  tt()

Table 4: Counts by Gender in docVisits

gender	n
female	2702
male	2488

Code

docVisits_modified %>%
  dplyr::count(freepoor) %>%
  tt()

Table 5: Counts of freepoor in docVisits

freepoor	n
no	4968
yes	222

Code

docVisits_modified %>%
  dplyr::count(across(.cols = c(freepoor, lchronic))) %>%
  tt()

Table 6: Counts of freepoor and lchronic

freepoor	lchronic	n
no	no	4389
no	yes	579
yes	no	196
yes	yes	26

Code

docVisits %>%
  count(across(where(is.character))) %>%
  tt()

Table 7: Counts of all Qual variables in docVisits

gender	private	freepoor	freerepat	nchronic	lchronic	n
female	no	no	no	no	no	310
female	no	no	no	no	yes	36
female	no	no	no	yes	no	186
female	no	no	yes	no	no	178
female	no	no	yes	no	yes	161
female	no	no	yes	yes	no	478
female	no	yes	no	no	no	49
female	no	yes	no	no	yes	10
female	no	yes	no	yes	no	25
female	yes	no	no	no	no	540
female	yes	no	no	no	yes	133
female	yes	no	no	yes	no	596
male	no	no	no	no	no	698
male	no	no	no	no	yes	75
male	no	no	no	yes	no	274
male	no	no	yes	no	no	68
male	no	no	yes	no	yes	76
male	no	no	yes	yes	no	130
male	no	yes	no	no	no	92
male	no	yes	no	no	yes	16
male	no	yes	no	yes	no	30
male	yes	no	no	no	no	558
male	yes	no	no	no	yes	98
male	yes	no	no	yes	no	373

Most factors are balanced in count. Except for freepoor and lchronic.
The counts for freepoor are heavily skewed towards no, which is expected in a general population.
lchronic has a very small count for yes, which may be a problem if we want to study this group.
The proportion of chronic sufferers is low both among those who opt for freepoor visits, and for those who do not.
The combinations of Qual Variables are very numerous. All we can say is that the counts are very dispersed. But that may be OK, if your Question of interest does not involve those combinations.

What is the most important dialogue uttered in the movie “Sholay”?

Which are the important Qual variables for your study?
Are the counts with respect the levels of these Qual variables nearly identical? Or is the data skewed towards certain levels?
Are there any levels that have very few observations?
Are there any levels that are missing altogether?
What combinations of levels are relevant for your study?
Are there any combinations of levels that are missing altogether?

Summarise Quant Variables

How about summaries for Quant variables?

Code

# Single Variable, Single Summary
docVisits %>%
  dplyr::summarise(mean_income = mean(income, na.rm = T))

Code

# Single Variable, Multiple Summaries
docVisits_modified %>%
  dplyr::summarise(
    mean_visits = mean(visits, na.rm = T),
    sd_visits = sd(visits, na.rm = T),
    min_visits = min(visits, na.rm = T),
    max_visits = max(visits, na.rm = T)
  )

Code

# Multiple Variables, Multiple Summaries
docVisits_modified %>%
  dplyr::summarise(across(
    .cols = c(visits, income), # select columns

    .fns = list(
      mean = ~ mean(., na.rm = T),
      sd = sd,
      min = min,
      max = max
    )
  ))

Mean visits are 0.3, with a high sd of 0.8, indicating a highly skewed distribution. This is confirmed by the min of 0 and max of 9 visits in a 2-week period.
income has a mean of \(0.5\), and a sd of \(0.368\), with a min of \(0.01\) and a max of \(1.5\). This indicates a reasonable spread of income in the dataset. From the skimr::skim() output in Section 8, we see that the distribution of income is skewed, but not terribly so.

Do different Quant variables have very different ranges? If so, you may have to \(scale/normalize\) them for computational ease.
Do the means and medians differ significantly? If so, the distribution may be skewed, and you may have to use the median as a more robust measure of central tendency. And also use quartiles to summarize the spread of the data. Transformations such as log or sqrt may also help.

Grouped Summaries

Why Grouped Summaries?

We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?

Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group. This will give us an idea whether different segments of the population, as defined by Qual variables and their levels, are relatively similar, or if there are significant differences between groups.

Creating Group Summaries

We can use dplyr::group_by() to make groups in the data, and then use dplyr::summarize() to get the summaries we need.

Code

docVisits_modified %>%
  group_by(gender) %>%
  summarize(average_visits = mean(visits), count = n())

Code

##
docVisits_modified %>%
  group_by(freepoor, nchronic) %>%
  summarise(
    mean_income = mean(income),
    average_visits = mean(visits),
    count = n()
  )

The package {crosstable} allows us to rapidly summarize multiple variables grouped and split by other variables, and presents the results in an elegant form. It also conveniently uses the formula interface that makes the code very crisp, and which we will be encountering with other important packages too. We will find occasion to meet {crosstable} again when we do Inference.

# library(crosstable)
crosstable(visits + income ~ gender + freepoor,
  data = docVisits_modified
) %>%
  crosstable::as_flextable()

Table 8: Crosstable Summary of visits and income over gender and freepoor

freepoor		no		yes
gender		female	male	female	male
visits	Min / Max	0 / 8.0	0 / 9.0	0 / 5.0	0 / 7.0
	Med [IQR]	0 [0;0]	0 [0;0]	0 [0;0]	0 [0;0]
	Mean (std)	0.4 (0.9)	0.2 (0.7)	0.2 (0.8)	0.1 (0.6)
	N (NA)	2618 (0)	2350 (0)	84 (0)	138 (0)
income	Min / Max	0 / 1.5	0 / 1.5	0 / 1.1	0 / 1.1
	Med [IQR]	0.3 [0.2;0.7]	0.7 [0.3;0.9]	0.2 [0.1;0.3]	0.2 [0.1;0.5]
	Mean (std)	0.5 (0.3)	0.7 (0.4)	0.2 (0.2)	0.3 (0.2)
	N (NA)	2618 (0)	2350 (0)	84 (0)	138 (0)

(The as_flextable command from the {crosstable} package helped to render this elegant HTML table we see. It should be possible to do Word/PDF also, which we might see later.)

Average visits for female gender patients seem to be higher
Average visits for freepoor patients seem to be higher ( and their mean income if lower of course)
Income for female gender patients seems to be lower
Median visits are \(0\) !! Clearly, most people do not visit the doctor in a 2-week period.
Clearly the people who are freepoor ( On Govt Insurance) AND with a chronic condition are those who have lower average income and a higher average number of visits to the doctor…but there are relatively few of them (n = 55) in this dataset…

Are there differences in mean of Quant variables across levels of single or multiple Qual variables? This could be a first look at whether the population is fractured into sub-groups and could be a point of research interest. E.g. disparity in interest in a product across groups.
Does sd differ significantly across groups? This could indicate that some groups are more heterogeneous than others, and may need to be studied further.

Summaries and Uncertainty

So, are we sure these summaries speak the truth? Are they accurate? Do they really represent a truth about the population from which this data sample was drawn?

We will need to deal with these ideas when we get to Inferential Statistics. For now, we will just note that the summaries we have obtained are sample statistics. We will need to perform some analysis to understand how well these sample statistics represent the population statistics.

More on dplyr

The {dplyr} package is capable of doing much more than just count, group_by and summarize. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on {dplyr} is here:

`dplyr` Tutorial

Your Turn

Star Trek Books
Math Anxiety! Hah!
Cardio Data Sets
Neuro Data Sets
Datasets from the Lock5 Textbook(Pruim (2015), Lock (2021))

Note

Which would be the Group By variables here? And what would you summarize? With which function?

Note

```{r}
library(CardioDataSets)
data(package = "CardioDataSets") # Lists datasets in the package
```

```{r}
library(NeuroDataSets)
data(package = "NeuroDataSets") # Lists datasets in the package
```

```{r}
library(Lock5Data)
library(Lock5withR)
data(package = "Lock5Data") # Lists datasets in the package
data(package = "Lock5withR") # Lists datasets in the package
```

Wait, But Why?

Data Summaries give you the essentials, without getting bogged down in the details(just yet).
Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
Summaries help evoke Questions and Hypotheses about the population, which may lead to inquiries, analysis, and insights
Grouped Summaries should tell you if:
- counts of groups in your target audience are lopsided/imbalanced; Go and Get your data again.
- there are visible differences in Quant data across groups, so your target audience could be nicely fractured;
- etc.

Conclusion

mosaic::inspect(), skimr::skim() and dplyr::glimpse() give us an overall summary of our data.
Using dplyr::count() we can get counts of levels of Qual variables, and combinations of levels of multiple Qual variables.
With dplyr::summarise() we can get summary statistics of Quant variables, singly or in pairs, or even all together.
Using dplyr::group_by() we can group the data by levels of one or more Qual variables, and then use dplyr::summarise() to get summary statistics of Quant variables within each group.
crosstable::crosstable() can also be used to get grouped summaries of multiple Quant variables over Qual variables, using the formula interface.

Make these part of your Workflow.

AI Generated Summary and Podcast

This is a tutorial on using the R programming language to perform descriptive statistical analysis on data sets. The tutorial focuses on summarizing data using various R packages like {dplyr} and {crosstable}. It emphasizes the importance of understanding the data’s structure, identifying different types of variables (qualitative and quantitative), and calculating summary statistics such as means, medians, and frequencies. The tutorial provides examples using real datasets and highlights the significance of data summaries in gaining initial insights, formulating research questions, and identifying potential issues with the data.

References

Lock, Lock, Lock, Lock, and Lock. (2021). Statistics: Unlocking the Power of Data, 3rd Edition). https://media.wiley.com/product_data/excerpt/69/11196821/1119682169-32.pdf

R Package Citations

Package	Version	Citation
CardioDataSets	0.2.0	Caceres Rossi (2025a)
crosstable	0.8.2	Chaltiel (2025)
janitor	2.2.1	Firke (2024)
Lock5Data	3.0.0	Lock (2021)
Lock5withR	1.2.2	Pruim (2015)
mosaic	1.9.2	Pruim, Kaplan, and Horton (2017)
NeuroDataSets	0.2.0	Caceres Rossi (2025b)
skimr	2.2.1	Waring et al. (2025)

Caceres Rossi, Renzo. 2025a. CardioDataSets: A Comprehensive Collection of Cardiovascular and Heart Disease Datasets. https://github.com/lightbluetitan/cardiodatasets.

———. 2025b. NeuroDataSets: A Comprehensive Collection of Neuroscience and Brain-Related Datasets. https://github.com/lightbluetitan/neurodatasets.

Chaltiel, Dan. 2025. crosstable: Crosstables for Descriptive Analyses. https://doi.org/10.32614/CRAN.package.crosstable.

Firke, Sam. 2024. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://doi.org/10.32614/CRAN.package.janitor.

Lock, Robin. 2021. Lock5Data: Datasets for “Statistics: UnLocking the Power of Data”. https://doi.org/10.32614/CRAN.package.Lock5Data.

Pruim, Randall. 2015. Lock5withR: Datasets for “Statistics: Unlocking the Power of Data”. https://github.com/rpruim/Lock5withR.

Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to ‘Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.

Stigler, Stephen M. 2016. “The Seven Pillars of Statistical Wisdom,” March. https://doi.org/10.4159/9780674970199.

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2025. skimr: Compact and Flexible Summaries of Data. https://doi.org/10.32614/CRAN.package.skimr.