The Mad Hatter’s Guide to Data Viz and Stats in R
  1. Data Viz and Stats
  2. Descriptive Analytics
  3. Summaries
  • Data Viz and Stats
    • Tools
      • Introduction to R and RStudio
    • Descriptive Analytics
      • Data
      • Inspect Data
      • Graphs
      • Summaries
      • Counts
      • Quantities
      • Groups
      • Distributions
      • Groups and Distributions
      • Change
      • Proportions
      • Parts of a Whole
      • Evolution and Flow
      • Ratings and Rankings
      • Surveys
      • Time
      • Space
      • Networks
      • Miscellaneous Graphing Tools, and References
    • Inference
      • Basics of Statistical Inference
      • 🎲 Samples, Populations, Statistics and Inference
      • Basics of Randomization Tests
      • Inference for a Single Mean
      • Inference for Two Independent Means
      • Inference for Comparing Two Paired Means
      • Comparing Multiple Means with ANOVA
      • Inference for Correlation
      • Testing a Single Proportion
      • Inference Test for Two Proportions
    • Modelling
      • Modelling with Linear Regression
      • Modelling with Logistic Regression
      • 🕔 Modelling and Predicting Time Series
    • Workflow
      • Facing the Abyss
      • I Publish, therefore I Am
      • Data Carpentry
    • Arts
      • Colours
      • Fonts in ggplot
      • Annotating Plots: Text, Labels, and Boxes
      • Annotations: Drawing Attention to Parts of the Graph
      • Highlighting parts of the Chart
      • Changing Scales on Charts
      • Assembling a Collage of Plots
      • Making Diagrams in R
    • AI Tools
      • Using gander and ellmer
      • Using Github Copilot and other AI tools to generate R code
      • Using LLMs to Explain Stat models
    • Case Studies
      • Demo:Product Packaging and Elderly People
      • Ikea Furniture
      • Movie Profits
      • Gender at the Work Place
      • Heptathlon
      • School Scores
      • Children's Games
      • Valentine’s Day Spending
      • Women Live Longer?
      • Hearing Loss in Children
      • California Transit Payments
      • Seaweed Nutrients
      • Coffee Flavours
      • Legionnaire’s Disease in the USA
      • Antarctic Sea ice
      • William Farr's Observations on Cholera in London
    • Projects
      • Project: Basics of EDA #1
      • Project: Basics of EDA #2
      • Experiments

On this page

  • 1 Setting up R Packages
  • 2 How do we Grasp Data?
    • 2.1 Reduction is Addition
    • 2.2 Throwing Away Data with Brad Pitt
    • 2.3 Literacy in the USA
    • 2.4 Why Summarize?
    • 2.5 And if we don’t summarise?
  • 3 What graphs / numbers will we see today?
    • 3.1 What are Summaries?
  • 4 How do these Summaries Work?
    • 4.1 Quant Variable Summaries
    • 4.2 Qual Variable Summaries
    • 4.3 Joint Summaries
    • 4.4 Simpson’s Paradox, Missing Data, and Imputation
  • 5 Some Quick Summary Definitions
    • 5.1 Mean
    • 5.2 Variance and Standard Deviation
    • 5.3 Median
  • 6 Case Study: DocVisits
    • 6.1 Read the Data
    • 6.2 Data Cleaning and Munging
    • 6.3 Final Clean Data Table
    • 6.4 Data Dictionary
  • 7 Summarise the Data
  • 8 Overall View of Data
  • 9 Summarise Qual Variables
  • 10 Summarise Quant Variables
  • 11 Grouped Summaries
    • 11.1 Why Grouped Summaries?
    • 11.2 Creating Group Summaries
  • 12 Summaries and Uncertainty
  • 13 More on dplyr
  • 14 Your Turn
  • 15 Wait, But Why?
  • 16 Conclusion
  • 17 AI Generated Summary and Podcast
  • 18 References
  1. Data Viz and Stats
  2. Descriptive Analytics
  3. Summaries

Summaries

Throwing away data to grasp it

Qual Variables
Quant Variables
Mean
Median
Standard Deviation
Quartiles
Author

Arvind V.

Published

October 15, 2023

Modified

October 1, 2025

Abstract
Bill Gates walked into a bar, and everyone’s salary went up on average.

“Love is like quicksilver in the hand. Leave the fingers open and it stays. Clutch it, and it darts away.”

— Dorothy Parker, author (22 Aug 1893-1967)

1 Setting up R Packages

library(tidyverse)
library(mosaic) # Our all-in-one package
library(skimr) # Looking at data
library(janitor) # Clean the data
library(naniar) # Handle missing data
library(visdat) # Visualise missing data
library(tinytable) # Printing Static Tables for our data
library(DT) # Interactive Tables for our data
library(crosstable) # Multiple variable summaries

Plot Fonts and Theme

Show the Code
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 8
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

2 How do we Grasp Data?

We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data. Then we learnt to Inspect Data to get a feel of the data, and to understand what the variables meant. We also cleaned up the data and arrived a freshly-minted dataset, ready for analysis.

However, despite this inspection, understanding, and cleaning, the actual data remains elusive for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.

The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.

2.1 Reduction is Addition

Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.

Steven Stigler (2016) is the author of the book “The Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation

The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, “The Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier day—and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

2.2 Throwing Away Data with Brad Pitt

Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.

2.3 Literacy in the USA

And then, an example from a more sombre story:

Year Below Level #1 Level #1 Level #2 Level #3 Levels #4 and #5
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.
Number in millions (2012/2014) 8.35 26.5 65.1 71.4 26.6
Number in millions (2017) 7.59 29.2 66.1 68.8 26.7
Table 1: US Adults Literacy and Numeracy Skills

This ghastly-looking Table 1 depicts U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen, the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

2.4 Why Summarize?

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane.

To summarize is to understand.

Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.

It is also a means of registering surprise: some of our first Questions about the data arise from an inspection of data summaries.

2.5 And if we don’t summarise?

Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was Big Data without Summary Statistics.

3 What graphs / numbers will we see today?

Variable #1 Variable #2 Chart Names “Chart Shape”
All All Tables and Stat Measures

3.1 What are Summaries?

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

  • Dataset length: How many rows/observations?
  • Dataset breadth: How many columns/variables?
  • How many Quant variables?
  • How many Qual variables?
  • Quant variables: min, max, mean, median, sd
  • Qual variables: levels, counts per level
  • Both: means, medians for each level of a Qual variable…

4 How do these Summaries Work?

4.1 Quant Variable Summaries

Quant variables: Inspecting the base::min, base::max, mean, median, variance and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error!

Comparing two Quant variables for their ranges also tells us that we may have to \(scale/normalize\) them for computational ease, if one variable has large numbers and the other has very small ones.

4.2 Qual Variable Summaries

Qual variables: With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these.

Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.

4.3 Joint Summaries

Together?: We can use Quant and Qual together, to develop the above summaries (min, max,mean, median and sd) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts.

This will tell us if our (sample) dataset already shows quantitative differences between sub-classes in the population.

4.4 Simpson’s Paradox, Missing Data, and Imputation

And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!

For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row).

Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.

5 Some Quick Summary Definitions

5.1 Mean

The sample mean, or average, of a Quantitative data variable can be calculated as the sum of the observed values divided by the number of observations:

\[ \large{mean = \bar{x} = \frac{x_1 + x_2+ x_3....+x_n}{n}} \]

5.2 Variance and Standard Deviation

Observations can be on either side of the mean, naturally. To measure the extent of these differences, we square and sum the differences between individual values and their mean, and take their average to obtain the (sample) variance:

\[ \large{variance = s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + (x_2 - \bar{x})^2 +...(x_n - \bar{x})^2}{n-1}} \]

The standard deviation \(s\) is just the square root of the variance. (The \(n-1\) is a mathematical nuance to allow for the fact that we have used the data to calculate the mean before we get to \(s^2\), and hence have “used up” one degree of randomness in the data. It gets us more robust results.)

5.3 Median

When the observations in a Quant variable are placed in order of their magnitude (i.e. rank), the observation in the middle is the median.

Half the observations are below, and half are above, the median.

6 Case Study: DocVisits

We will (again) use this superb repository of datasets created by Vincent Arel-Bundock. Let us choose a modest-sized dataset, say this dataset on Doctor Visits, which is available online here and read it into R. We will clean it, munge it, and prepare it in one shot with everything we learnt in the Inspect Data module.

6.1 Read the Data

docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(docVisits)
Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ gender    <chr> "female", "female", "male", "male", "male", "female", "femal…
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
$ private   <chr> "yes", "yes", "no", "no", "no", "no", "no", "no", "yes", "ye…
$ freepoor  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ freerepat <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ nchronic  <chr> "no", "no", "no", "no", "yes", "yes", "no", "no", "no", "no"…
$ lchronic  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
Table 2: Doctor Visits Dataset

So 5190 rows and 13 columns. Several variables are of class character (gender, private, freepoor, freerepat, nchronic, lchronic ) and several are double (visits, age, income, illness, reduced, health).

6.2 Data Cleaning and Munging

We will first clean the data, and then modify it to make it ready for analysis.

Show the Code
docVisits_modified <- docVisits %>%
  # Replace common NA strings and numbers with actual NA
  naniar::replace_with_na_all(condition = ~ .x %in% common_na_strings) %>%
  naniar::replace_with_na_all(condition = ~ .x %in% common_na_numbers) %>%
  # Clean variable names
  janitor::clean_names(case = "snake") %>% # clean names

  # Convert character variables to factors
  mutate(
    gender = as_factor(gender),
    private = as_factor(private),
    freepoor = as_factor(freepoor),
    freerepat = as_factor(freerepat),
    nchronic = as_factor(nchronic),
    lchronic = as_factor(lchronic)
  ) %>%
  # arrange the character variables first
  dplyr::relocate(where(is.factor), .after = rownames)


docVisits_modified %>% glimpse()
Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ gender    <fct> female, female, male, male, male, female, female, female, fe…
$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, no, …
$ freepoor  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, no,…
$ nchronic  <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, yes, …
$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …

6.3 Final Clean Data Table

Show the Code
docVisits_modified %>%
  DT::datatable(
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: left; color: black; font-size: 150%;",
      "Doctor Visits Dataset (Clean)"
    ),
    options = list(pageLength = 10, autoWidth = TRUE)
  ) %>%
  DT::formatStyle(
    columns = names(docVisits_modified),
    fontFamily = "Roboto Condensed",
    fontSize = "12px"
  )
Table 3: Doctor Visits Dataset (Clean)

6.4 Data Dictionary

We can set up the Data Dictionary from the website describing the data:

Variable Description
visits Number of doctor visits in past 2 weeks.
gender Factor indicating gender.
age Age in years divided by 100.
income Annual income in tens of thousands of dollars.
illness Number of illnesses in past 2 weeks.
reduced Number of days of reduced activity in past 2 weeks due to illness or injury.
health General health questionnaire score using Goldberg’s method.
private Factor. Does the individual have private health docVisits?
freepoor Factor. Does the individual have free government health docVisits due to low income?
freerepat Factor. Does the individual have free government health docVisits due to old age, disability or veteran status?
nchronic Factor. Is there a chronic condition not limiting activity?
lchronic Factor. Is there a chronic condition limiting activity?

7 Summarise the Data

We now proceed to extract summary statistics from the data. We will first work with individual variables, and then use sensible combinations to summarize with, based on our understanding of the variables involved.

We will use:

  • Overall view: skimr::skim(), mosaic::inspect(), and dplyr::glimpse()
  • Qual variables: dplyr::count()
  • Quant variables: dplyr::summarise()
  • Both together: dplyr::group_by() + dplyr::summarize(); and crosstable::crosstable()

to develop our intuitions.

8 Overall View of Data

  • dplyr::glimpse()
  • skimr::skim()
  • mosaic::inspect()
  • Business Insights
  • What should we look for?

We are familiar with dplyr::glimpse(), which gives us a quick overview of the data structure.

docVisits_modified %>% dplyr::glimpse()
Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ gender    <fct> female, female, male, male, male, female, female, female, fe…
$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, no, …
$ freepoor  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, no,…
$ nchronic  <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, yes, …
$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
docVisits_modified %>% skimr::skim()
Data summary
Name Piped data
Number of rows 5190
Number of columns 13
_______________________
Column type frequency:
factor 6
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 fem: 2702, mal: 2488
private 0 1 FALSE 2 no: 2892, yes: 2298
freepoor 0 1 FALSE 2 no: 4968, yes: 222
freerepat 0 1 FALSE 2 no: 4099, yes: 1091
nchronic 0 1 FALSE 2 no: 3098, yes: 2092
lchronic 0 1 FALSE 2 no: 4585, yes: 605

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rownames 3 1 2596.96 1497.58 1.00 1300.50 2597.00 3893.50 5190.00 ▇▇▇▇▇
visits 0 1 0.30 0.80 0.00 0.00 0.00 0.00 9.00 ▇▁▁▁▁
age 0 1 0.41 0.20 0.19 0.22 0.32 0.62 0.72 ▇▂▁▂▅
income 0 1 0.58 0.37 0.00 0.25 0.55 0.90 1.50 ▇▆▅▅▂
illness 0 1 1.43 1.38 0.00 0.00 1.00 2.00 5.00 ▇▂▂▁▁
reduced 0 1 0.86 2.89 0.00 0.00 0.00 0.00 14.00 ▇▁▁▁▁
health 0 1 1.22 2.12 0.00 0.00 0.00 2.00 12.00 ▇▁▁▁▁

In addition to these, skimr::skim() gives a neat little histogram of the Quant variables, which is very useful to get a quick idea of the distribution of values in the variables. It can also be used to detect Quant Variables that are actually Qual Variables, detectable by a very limited set of bars in their histogram.

inspect_output <- docVisits_modified %>% mosaic::inspect()
inspect_output[[1]] %>% tt()
name class levels n missing distribution
gender factor 2 5190 0 female (52.1%), male (47.9%)
private factor 2 5190 0 no (55.7%), yes (44.3%)
freepoor factor 2 5190 0 no (95.7%), yes (4.3%)
freerepat factor 2 5190 0 no (79%), yes (21%)
nchronic factor 2 5190 0 no (59.7%), yes (40.3%)
lchronic factor 2 5190 0 no (88.3%), yes (11.7%)
inspect_output[[2]] %>% tt()
name class min Q1 median Q3 max mean sd n missing
rownames numeric 1 1300 2597 3894 5190 2597 1498 5187 3
visits numeric 0 0 0 0 9 0.3 0.8 5190 0
age numeric 0.19 0.22 0.32 0.62 0.72 0.41 0.2 5190 0
income numeric 0 0.25 0.55 0.9 1.5 0.58 0.37 5190 0
illness numeric 0 0 1 2 5 1.4 1.4 5190 0
reduced numeric 0 0 0 0 14 0.86 2.9 5190 0
health numeric 0 0 0 2 12 1.2 2.1 5190 0

The other two functions give detailed summaries of the data, separately for Qual and Quant variables. These are mean,sd,median and quartiles for Quant variables, and levels and counts for Qual variables. missing data is also flagged.

  • We could think of visits as our response or target variable, and the rest as explanatory variables.
  • Mean visits are 0.3 and the distribution is very skewed to the right ( from skimr::skim() histogram). Most people do not visit the doctor in a 2-week period.
  • Among Qual variables, freepoor and lchronic have very unbalanced counts. Rest are reasonably balanced.
  • reduced shows that some people have been ill the entire preceding 2weeks. (reduced = 14 days)
  • health has a max Goldberg Score of \(12\), but the mean is a much lower \(1.21\), indicating that many in this dataset are not really feeling well.
  • Which are the important Quant and Qual variables for your study?
  • What are their units?
  • What are their ranges? Are these sensible, e.g. 120% for a variable that is a percentage?
  • What are the *levels of the Qual variables of interest? Are they too many ( e.g. 34 manufacturers in the mtcars dataset? Or too few?
  • Understand the means and variances. Do they make sense?
  • Could any relevant variable be missing altogether?

9 Summarise Qual Variables

  • Gender
  • Private
  • Freepoor and Lchronic
  • All Factors
  • Business Insights
  • What should we look for?
Show the Code
## Counting by the obvious factor variables
docVisits_modified %>%
  dplyr::count(gender) %>%
  tt()
gender n
female 2702
male 2488
Table 4: Counts by Gender in docVisits
Show the Code
docVisits_modified %>%
  dplyr::count(freepoor) %>%
  tt()
freepoor n
no 4968
yes 222
Table 5: Counts of freepoor in docVisits
Show the Code
docVisits_modified %>%
  dplyr::count(across(.cols = c(freepoor, lchronic))) %>%
  tt()
freepoor lchronic n
no no 4389
no yes 579
yes no 196
yes yes 26
Table 6: Counts of freepoor and lchronic
Show the Code
docVisits %>%
  count(across(where(is.character))) %>%
  tt()
gender private freepoor freerepat nchronic lchronic n
female no no no no no 310
female no no no no yes 36
female no no no yes no 186
female no no yes no no 178
female no no yes no yes 161
female no no yes yes no 478
female no yes no no no 49
female no yes no no yes 10
female no yes no yes no 25
female yes no no no no 540
female yes no no no yes 133
female yes no no yes no 596
male no no no no no 698
male no no no no yes 75
male no no no yes no 274
male no no yes no no 68
male no no yes no yes 76
male no no yes yes no 130
male no yes no no no 92
male no yes no no yes 16
male no yes no yes no 30
male yes no no no no 558
male yes no no no yes 98
male yes no no yes no 373
Table 7: Counts of all Qual variables in docVisits
  • Most factors are balanced in count. Except for freepoor and lchronic.
  • The counts for freepoor are heavily skewed towards no, which is expected in a general population.
  • lchronic has a very small count for yes, which may be a problem if we want to study this group.
  • The proportion of chronic sufferers is low both among those who opt for freepoor visits, and for those who do not.
  • The combinations of Qual Variables are very numerous. All we can say is that the counts are very dispersed. But that may be OK, if your Question of interest does not involve those combinations.

What is the most important dialogue uttered in the movie “Sholay”?

  • Which are the important Qual variables for your study?
  • Are the counts with respect the levels of these Qual variables nearly identical? Or is the data skewed towards certain levels?
  • Are there any levels that have very few observations?
  • Are there any levels that are missing altogether?
  • What combinations of levels are relevant for your study?
  • Are there any combinations of levels that are missing altogether?

10 Summarise Quant Variables

How about summaries for Quant variables?

  • Single Variable, Single Summary
  • Single Variable, Multiple Summaries
  • Multiple Variables, Multiple Summaries
  • Business Insights
  • What should we look for?
Show the Code
# Single Variable, Single Summary
docVisits %>%
  dplyr::summarise(mean_income = mean(income, na.rm = T))
Show the Code
# Single Variable, Multiple Summaries
docVisits_modified %>%
  dplyr::summarise(
    mean_visits = mean(visits, na.rm = T),
    sd_visits = sd(visits, na.rm = T),
    min_visits = min(visits, na.rm = T),
    max_visits = max(visits, na.rm = T)
  )
Show the Code
# Multiple Variables, Multiple Summaries
docVisits_modified %>%
  dplyr::summarise(across(
    .cols = c(visits, income), # select columns

    .fns = list(
      mean = ~ mean(., na.rm = T),
      sd = sd,
      min = min,
      max = max
    )
  ))
  • Mean visits are 0.3, with a high sd of 0.8, indicating a highly skewed distribution. This is confirmed by the min of 0 and max of 9 visits in a 2-week period.
  • income has a mean of \(0.5\), and a sd of \(0.368\), with a min of \(0.01\) and a max of \(1.5\). This indicates a reasonable spread of income in the dataset. From the skimr::skim() output in Section 8, we see that the distribution of income is skewed, but not terribly so.
  • Do different Quant variables have very different ranges? If so, you may have to \(scale/normalize\) them for computational ease.
  • Do the means and medians differ significantly? If so, the distribution may be skewed, and you may have to use the median as a more robust measure of central tendency. And also use quartiles to summarize the spread of the data. Transformations such as log or sqrt may also help.

11 Grouped Summaries

11.1 Why Grouped Summaries?

We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?

Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group. This will give us an idea whether different segments of the population, as defined by Qual variables and their levels, are relatively similar, or if there are significant differences between groups.

11.2 Creating Group Summaries

  • dplyr::group_by() and dplyr::summarize()
  • Intro to crosstable
  • Using crosstable
  • Business Insights from Grouped Quant Summaries (docVisits)
  • What should we look for?

We can use dplyr::group_by() to make groups in the data, and then use dplyr::summarize() to get the summaries we need.

Show the Code
docVisits_modified %>%
  group_by(gender) %>%
  summarize(average_visits = mean(visits), count = n())
Show the Code
##
docVisits_modified %>%
  group_by(freepoor, nchronic) %>%
  summarise(
    mean_income = mean(income),
    average_visits = mean(visits),
    count = n()
  )

The package crosstable allows us to rapidly summarize multiple variables grouped and split by other variables, and presents the results in an elegant form. It also conveniently uses the formula interface that makes the code very crisp, and which we will be encountering with other important packages too. We will find occasion to meet crosstable again when we do Inference.

# library(crosstable)
crosstable(visits + income ~ gender + freepoor,
  data = docVisits_modified
) %>%
  crosstable::as_flextable()

freepoor

no

yes

gender

female

male

female

male

visits

Min / Max

0 / 8.0

0 / 9.0

0 / 5.0

0 / 7.0

Med [IQR]

0 [0;0]

0 [0;0]

0 [0;0]

0 [0;0]

Mean (std)

0.4 (0.9)

0.2 (0.7)

0.2 (0.8)

0.1 (0.6)

N (NA)

2618 (0)

2350 (0)

84 (0)

138 (0)

income

Min / Max

0 / 1.5

0 / 1.5

0 / 1.1

0 / 1.1

Med [IQR]

0.3 [0.2;0.7]

0.7 [0.3;0.9]

0.2 [0.1;0.3]

0.2 [0.1;0.5]

Mean (std)

0.5 (0.3)

0.7 (0.4)

0.2 (0.2)

0.3 (0.2)

N (NA)

2618 (0)

2350 (0)

84 (0)

138 (0)

Table 8: Crosstable Summary of visits and income over gender and freepoor

(The as_flextable command from the crosstable package helped to render this elegant HTML table we see. It should be possible to do Word/PDF also, which we might see later.)

  • Average visits for female gender patients seem to be higher
  • Average visits for freepoor patients seem to be higher ( and their mean income if lower of course)
  • Income for female gender patients seems to be lower
  • Median visits are \(0\) !! Clearly, most people do not visit the doctor in a 2-week period.
  • Clearly the people who are freepoor ( On Govt Insurance) AND with a chronic condition are those who have lower average income and a higher average number of visits to the doctor…but there are relatively few of them (n = 55) in this dataset…
  • Are there differences in mean of Quant variables across levels of single or multiple Qual variables? This could be a first look at whether the population is fractured into sub-groups and could be a point of research interest. E.g. disparity in interest in a product across groups.
  • Does sd differ significantly across groups? This could indicate that some groups are more heterogeneous than others, and may need to be studied further.

12 Summaries and Uncertainty

So, are we sure these summaries speak the truth? Are they accurate? Do they really represent a truth about the population from which this data sample was drawn?

We will need to deal with these ideas when we get to Inferential Statistics. For now, we will just note that the summaries we have obtained are sample statistics. We will need to perform some analysis to understand how well these sample statistics represent the population statistics.

13 More on dplyr

The dplyr package is capable of doing much more than just count, group_by and summarize. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on dplyr is here:

dplyr Tutorial

14 Your Turn

  1. Star Trek Books
  2. Math Anxiety! Hah!
  3. Cardio Data Sets
  4. Neuro Data Sets
  5. Datasets from the Lock5 Textbook(Pruim (2015), Lock (2021))
Note

Which would be the Group By variables here? And what would you summarize? With which function?

Note
```{r}
library(CardioDataSets)
data(package = "CardioDataSets") # Lists datasets in the package
```
```{r}
library(NeuroDataSets)
data(package = "NeuroDataSets") # Lists datasets in the package
```
```{r}
library(Lock5Data)
library(Lock5withR)
data(package = "Lock5Data") # Lists datasets in the package
data(package = "Lock5withR") # Lists datasets in the package
```

15 Wait, But Why?

  • Data Summaries give you the essentials, without getting bogged down in the details(just yet).
  • Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
  • Summaries help evoke Questions and Hypotheses about the population, which may lead to inquiries, analysis, and insights
  • Grouped Summaries should tell you if:
    • counts of groups in your target audience are lopsided/imbalanced; Go and Get your data again.
    • there are visible differences in Quant data across groups, so your target audience could be nicely fractured;
    • etc.

16 Conclusion

  • mosaic::inspect(), skimr::skim() and dplyr::glimpse() give us an overall summary of our data.
  • Using dplyr::count() we can get counts of levels of Qual variables, and combinations of levels of multiple Qual variables.
  • With dplyr::summarise() we can get summary statistics of Quant variables, singly or in pairs, or even all together.
  • Using dplyr::group_by() we can group the data by levels of one or more Qual variables, and then use dplyr::summarise() to get summary statistics of Quant variables within each group.
  • crosstable::crosstable() can also be used to get grouped summaries of multiple Quant variables over Qual variables, using the formula interface.

Make these part of your Workflow.

17 AI Generated Summary and Podcast

This is a tutorial on using the R programming language to perform descriptive statistical analysis on data sets. The tutorial focuses on summarizing data using various R packages like dplyr and crosstable. It emphasizes the importance of understanding the data’s structure, identifying different types of variables (qualitative and quantitative), and calculating summary statistics such as means, medians, and frequencies. The tutorial provides examples using real datasets and highlights the significance of data summaries in gaining initial insights, formulating research questions, and identifying potential issues with the data.

Your browser does not support the audio tag; for browser support, please see: https://www.w3schools.com/tags/tag_audio.asp

18 References

  1. Lock, Lock, Lock, Lock, and Lock. (2021). Statistics: Unlocking the Power of Data, 3rd Edition). https://media.wiley.com/product_data/excerpt/69/11196821/1119682169-32.pdf

R Package Citations

Package Version Citation
CardioDataSets 0.2.0 Caceres Rossi (2025a)
crosstable 0.8.2 Chaltiel (2025)
janitor 2.2.1 Firke (2024)
Lock5Data 3.0.0 Lock (2021)
Lock5withR 1.2.2 Pruim (2015)
mosaic 1.9.2 Pruim, Kaplan, and Horton (2017)
NeuroDataSets 0.2.0 Caceres Rossi (2025b)
skimr 2.2.1 Waring et al. (2025)
Caceres Rossi, Renzo. 2025a. CardioDataSets: A Comprehensive Collection of Cardiovascular and Heart Disease Datasets. https://github.com/lightbluetitan/cardiodatasets.
———. 2025b. NeuroDataSets: A Comprehensive Collection of Neuroscience and Brain-Related Datasets. https://github.com/lightbluetitan/neurodatasets.
Chaltiel, Dan. 2025. crosstable: Crosstables for Descriptive Analyses. https://doi.org/10.32614/CRAN.package.crosstable.
Firke, Sam. 2024. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://doi.org/10.32614/CRAN.package.janitor.
Lock, Robin. 2021. Lock5Data: Datasets for “Statistics: UnLocking the Power of Data”. https://doi.org/10.32614/CRAN.package.Lock5Data.
Pruim, Randall. 2015. Lock5withR: Datasets for “Statistics: Unlocking the Power of Data”. https://github.com/rpruim/Lock5withR.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to ‘Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Stigler, Stephen M. 2016. “The Seven Pillars of Statistical Wisdom,” March. https://doi.org/10.4159/9780674970199.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2025. skimr: Compact and Flexible Summaries of Data. https://doi.org/10.32614/CRAN.package.skimr.
Back to top

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {\textless Iconify-Icon Icon=“carbon:summary-Kpi”
    Width=“1.2em”
    Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
    {Summaries}},
  date = {2023-10-15},
  url = {https://madhatterguide.netlify.app/content/courses/Analytics/10-Descriptive/Modules/10-FavStats/},
  langid = {en},
  abstract = {Bill Gates walked into a bar, and everyone’s salary went
    up on average.}
}
For attribution, please cite this work as:
V., Arvind. 2023. “<Iconify-Icon Icon=‘carbon:summary-Kpi’ Width=‘1.2em’ Height=‘1.2em’></Iconify-Icon> Summaries.” October 15, 2023. https://madhatterguide.netlify.app/content/courses/Analytics/10-Descriptive/Modules/10-FavStats/.
Graphs
Counts

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .