library(tidyverse)
library(mosaic) # Our all-in-one package
library(skimr) # Looking at data
library(janitor) # Clean the data
library(naniar) # Handle missing data
library(visdat) # Visualise missing data
library(tinytable) # Printing Static Tables for our data
library(DT) # Interactive Tables for our data
library(crosstable) # Multiple variable summaries
Summaries
Throwing away data to grasp it
“Love is like quicksilver in the hand. Leave the fingers open and it stays. Clutch it, and it darts away.”
— Dorothy Parker, author (22 Aug 1893-1967)
1 Setting up R Packages
Plot Fonts and Theme
Show the Code
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
family = "Alegreya",
regular = "../../../../../../fonts/Alegreya-Regular.ttf",
bold = "../../../../../../fonts/Alegreya-Bold.ttf",
italic = "../../../../../../fonts/Alegreya-Italic.ttf",
bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)
sysfonts::font_add(
family = "Roboto Condensed",
regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
theme_bw(base_size = 10) +
theme_sub_axis(
title = element_text(
family = "Roboto Condensed",
size = 8
),
text = element_text(
family = "Roboto Condensed",
size = 6
)
) +
theme_sub_legend(
text = element_text(
family = "Roboto Condensed",
size = 6
),
title = element_text(
family = "Alegreya",
size = 8
)
) +
theme_sub_plot(
title = element_text(
family = "Alegreya",
size = 14, face = "bold"
),
title.position = "plot",
subtitle = element_text(
family = "Alegreya",
size = 10
),
caption = element_text(
family = "Alegreya",
size = 6
),
caption.position = "plot"
)
}
## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "marquee", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
## Set the theme
ggplot2::theme_set(new = theme_custom())
## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)
## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")
2 How do we Grasp Data?
We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data. Then we learnt to Inspect Data to get a feel of the data, and to understand what the variables meant. We also cleaned up the data and arrived a freshly-minted dataset, ready for analysis.
However, despite this inspection, understanding, and cleaning, the actual data remains elusive for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.
The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.
2.1 Reduction is Addition
Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.
Steven Stigler (2016) is the author of the book “The Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation
The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, “The Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier day—and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.
2.2 Throwing Away Data with Brad Pitt
Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.
2.3 Literacy in the USA
And then, an example from a more sombre story:
Year | Below Level #1 | Level #1 | Level #2 | Level #3 | Levels #4 and #5 |
---|---|---|---|---|---|
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014. | |||||
Number in millions (2012/2014) | 8.35 | 26.5 | 65.1 | 71.4 | 26.6 |
Number in millions (2017) | 7.59 | 29.2 | 66.1 | 68.8 | 26.7 |
This ghastly-looking Table 1 depicts U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen, the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!
2.4 Why Summarize?
So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane.
To summarize is to understand.
Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.
It is also a means of registering surprise: some of our first Questions about the data arise from an inspection of data summaries.
2.5 And if we don’t summarise?
Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)
Aggregation can yield great gains above the individual components in data. Funes was Big Data without Summary Statistics.
3 What graphs / numbers will we see today?
Variable #1 | Variable #2 | Chart Names | “Chart Shape” |
---|---|---|---|
All | All | Tables and Stat Measures |
|
3.1 What are Summaries?
Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:
- Dataset length: How many rows/observations?
- Dataset breadth: How many columns/variables?
- How many Quant variables?
- How many Qual variables?
- Quant variables: min, max, mean, median, sd
- Qual variables: levels, counts per level
- Both: means, medians for each level of a Qual variable…
4 How do these Summaries Work?
4.1 Quant Variable Summaries
Quant variables: Inspecting the base::min
, base::max
, mean
, median
, variance
and sd
of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error!
Comparing two Quant variables for their ranges also tells us that we may have to \(scale/normalize\) them for computational ease, if one variable has large numbers and the other has very small ones.
4.2 Qual Variable Summaries
Qual variables: With Qual variables, we understand the levels
within each, and understand the total number of combinations of the levels across these.
Counts
across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.
4.3 Joint Summaries
Together?: We can use Quant and Qual together, to develop the above summaries (min
, max
,mean
, median
and sd
) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts
.
This will tell us if our (sample) dataset already shows quantitative differences between sub-classes in the population.
4.4 Simpson’s Paradox, Missing Data, and Imputation
And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!
For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row).
Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.
5 Some Quick Summary Definitions
5.1 Mean
The sample mean, or average, of a Quantitative data variable can be calculated as the sum of the observed values divided by the number of observations:
\[ \large{mean = \bar{x} = \frac{x_1 + x_2+ x_3....+x_n}{n}} \]
5.2 Variance and Standard Deviation
Observations can be on either side of the mean, naturally. To measure the extent of these differences, we square and sum the differences between individual values and their mean, and take their average to obtain the (sample) variance
:
\[ \large{variance = s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + (x_2 - \bar{x})^2 +...(x_n - \bar{x})^2}{n-1}} \]
The standard deviation \(s\) is just the square root of the variance. (The \(n-1\) is a mathematical nuance to allow for the fact that we have used the data to calculate the mean before we get to \(s^2\), and hence have “used up” one degree of randomness in the data. It gets us more robust results.)
5.3 Median
When the observations in a Quant variable are placed in order of their magnitude (i.e. rank), the observation in the middle is the median
.
Half the observations are below, and half are above, the median
.
6 Case Study: DocVisits
We will (again) use this superb repository of datasets created by Vincent Arel-Bundock. Let us choose a modest-sized dataset, say this dataset on Doctor Visits
, which is available online here and read it into R. We will clean it, munge it, and prepare it in one shot with everything we learnt in the Inspect Data module.
6.1 Read the Data
docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(docVisits)
Rows: 5,190
Columns: 13
$ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ visits <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ gender <chr> "female", "female", "male", "male", "male", "female", "femal…
$ age <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
$ private <chr> "yes", "yes", "no", "no", "no", "no", "no", "no", "yes", "ye…
$ freepoor <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ freerepat <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ nchronic <chr> "no", "no", "no", "no", "yes", "yes", "no", "no", "no", "no"…
$ lchronic <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
So 5190 rows and 13 columns. Several variables are of class
character
(gender, private, freepoor, freerepat, nchronic, lchronic
) and several are double
(visits, age, income, illness, reduced, health
).
6.2 Data Cleaning and Munging
We will first clean the data, and then modify it to make it ready for analysis.
Show the Code
docVisits_modified <- docVisits %>%
# Replace common NA strings and numbers with actual NA
naniar::replace_with_na_all(condition = ~ .x %in% common_na_strings) %>%
naniar::replace_with_na_all(condition = ~ .x %in% common_na_numbers) %>%
# Clean variable names
janitor::clean_names(case = "snake") %>% # clean names
# Convert character variables to factors
mutate(
gender = as_factor(gender),
private = as_factor(private),
freepoor = as_factor(freepoor),
freerepat = as_factor(freerepat),
nchronic = as_factor(nchronic),
lchronic = as_factor(lchronic)
) %>%
# arrange the character variables first
dplyr::relocate(where(is.factor), .after = rownames)
docVisits_modified %>% glimpse()
Rows: 5,190
Columns: 13
$ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ gender <fct> female, female, male, male, male, female, female, female, fe…
$ private <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, no, …
$ freepoor <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, no,…
$ nchronic <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, yes, …
$ lchronic <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ visits <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ age <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
6.3 Final Clean Data Table
Show the Code
docVisits_modified %>%
DT::datatable(
caption = htmltools::tags$caption(
style = "caption-side: top; text-align: left; color: black; font-size: 150%;",
"Doctor Visits Dataset (Clean)"
),
options = list(pageLength = 10, autoWidth = TRUE)
) %>%
DT::formatStyle(
columns = names(docVisits_modified),
fontFamily = "Roboto Condensed",
fontSize = "12px"
)
6.4 Data Dictionary
We can set up the Data Dictionary from the website describing the data:
Variable | Description |
---|---|
visits | Number of doctor visits in past 2 weeks. |
gender | Factor indicating gender. |
age | Age in years divided by 100. |
income | Annual income in tens of thousands of dollars. |
illness | Number of illnesses in past 2 weeks. |
reduced | Number of days of reduced activity in past 2 weeks due to illness or injury. |
health | General health questionnaire score using Goldberg’s method. |
private | Factor. Does the individual have private health docVisits? |
freepoor | Factor. Does the individual have free government health docVisits due to low income? |
freerepat | Factor. Does the individual have free government health docVisits due to old age, disability or veteran status? |
nchronic | Factor. Is there a chronic condition not limiting activity? |
lchronic | Factor. Is there a chronic condition limiting activity? |
7 Summarise the Data
We now proceed to extract summary statistics from the data. We will first work with individual variables, and then use sensible combinations to summarize with, based on our understanding of the variables involved.
We will use:
- Overall view:
skimr::skim()
,mosaic::inspect()
, anddplyr::glimpse()
- Qual variables:
dplyr::count()
- Quant variables:
dplyr::summarise()
- Both together:
dplyr::group_by()
+dplyr::summarize()
; andcrosstable::crosstable()
to develop our intuitions.
8 Overall View of Data
We are familiar with dplyr::glimpse()
, which gives us a quick overview of the data structure.
Rows: 5,190
Columns: 13
$ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ gender <fct> female, female, male, male, male, female, female, female, fe…
$ private <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, no, …
$ freepoor <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, no,…
$ nchronic <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, yes, …
$ lchronic <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
$ visits <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ age <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
Name | Piped data |
Number of rows | 5190 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
factor | 6 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
gender | 0 | 1 | FALSE | 2 | fem: 2702, mal: 2488 |
private | 0 | 1 | FALSE | 2 | no: 2892, yes: 2298 |
freepoor | 0 | 1 | FALSE | 2 | no: 4968, yes: 222 |
freerepat | 0 | 1 | FALSE | 2 | no: 4099, yes: 1091 |
nchronic | 0 | 1 | FALSE | 2 | no: 3098, yes: 2092 |
lchronic | 0 | 1 | FALSE | 2 | no: 4585, yes: 605 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rownames | 3 | 1 | 2596.96 | 1497.58 | 1.00 | 1300.50 | 2597.00 | 3893.50 | 5190.00 | ▇▇▇▇▇ |
visits | 0 | 1 | 0.30 | 0.80 | 0.00 | 0.00 | 0.00 | 0.00 | 9.00 | ▇▁▁▁▁ |
age | 0 | 1 | 0.41 | 0.20 | 0.19 | 0.22 | 0.32 | 0.62 | 0.72 | ▇▂▁▂▅ |
income | 0 | 1 | 0.58 | 0.37 | 0.00 | 0.25 | 0.55 | 0.90 | 1.50 | ▇▆▅▅▂ |
illness | 0 | 1 | 1.43 | 1.38 | 0.00 | 0.00 | 1.00 | 2.00 | 5.00 | ▇▂▂▁▁ |
reduced | 0 | 1 | 0.86 | 2.89 | 0.00 | 0.00 | 0.00 | 0.00 | 14.00 | ▇▁▁▁▁ |
health | 0 | 1 | 1.22 | 2.12 | 0.00 | 0.00 | 0.00 | 2.00 | 12.00 | ▇▁▁▁▁ |
In addition to these, skimr::skim()
gives a neat little histogram of the Quant variables, which is very useful to get a quick idea of the distribution of values in the variables. It can also be used to detect Quant Variables that are actually Qual Variables, detectable by a very limited set of bars in their histogram.
name | class | levels | n | missing | distribution |
---|---|---|---|---|---|
gender | factor | 2 | 5190 | 0 | female (52.1%), male (47.9%) |
private | factor | 2 | 5190 | 0 | no (55.7%), yes (44.3%) |
freepoor | factor | 2 | 5190 | 0 | no (95.7%), yes (4.3%) |
freerepat | factor | 2 | 5190 | 0 | no (79%), yes (21%) |
nchronic | factor | 2 | 5190 | 0 | no (59.7%), yes (40.3%) |
lchronic | factor | 2 | 5190 | 0 | no (88.3%), yes (11.7%) |
name | class | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|---|
rownames | numeric | 1 | 1300 | 2597 | 3894 | 5190 | 2597 | 1498 | 5187 | 3 |
visits | numeric | 0 | 0 | 0 | 0 | 9 | 0.3 | 0.8 | 5190 | 0 |
age | numeric | 0.19 | 0.22 | 0.32 | 0.62 | 0.72 | 0.41 | 0.2 | 5190 | 0 |
income | numeric | 0 | 0.25 | 0.55 | 0.9 | 1.5 | 0.58 | 0.37 | 5190 | 0 |
illness | numeric | 0 | 0 | 1 | 2 | 5 | 1.4 | 1.4 | 5190 | 0 |
reduced | numeric | 0 | 0 | 0 | 0 | 14 | 0.86 | 2.9 | 5190 | 0 |
health | numeric | 0 | 0 | 0 | 2 | 12 | 1.2 | 2.1 | 5190 | 0 |
The other two functions give detailed summaries of the data, separately for Qual and Quant variables. These are mean
,sd
,median
and quartiles
for Quant variables, and levels
and counts
for Qual variables. missing
data is also flagged.
- We could think of
visits
as our response or target variable, and the rest as explanatory variables. - Mean
visits
are 0.3 and the distribution is very skewed to the right ( fromskimr::skim()
histogram). Most people do not visit the doctor in a 2-week period. - Among Qual variables,
freepoor
andlchronic
have very unbalanced counts. Rest are reasonably balanced. -
reduced
shows that some people have been ill the entire preceding 2weeks. (reduced
= 14 days) -
health
has a max Goldberg Score of \(12\), but the mean is a much lower \(1.21\), indicating that many in this dataset are not really feeling well.
- Which are the important Quant and Qual variables for your study?
- What are their units?
- What are their ranges? Are these sensible, e.g. 120% for a variable that is a percentage?
- What are the *levels of the Qual variables of interest? Are they too many ( e.g. 34 manufacturers in the
mtcars
dataset? Or too few? - Understand the means and variances. Do they make sense?
- Could any relevant variable be missing altogether?
9 Summarise Qual Variables
gender | private | freepoor | freerepat | nchronic | lchronic | n |
---|---|---|---|---|---|---|
female | no | no | no | no | no | 310 |
female | no | no | no | no | yes | 36 |
female | no | no | no | yes | no | 186 |
female | no | no | yes | no | no | 178 |
female | no | no | yes | no | yes | 161 |
female | no | no | yes | yes | no | 478 |
female | no | yes | no | no | no | 49 |
female | no | yes | no | no | yes | 10 |
female | no | yes | no | yes | no | 25 |
female | yes | no | no | no | no | 540 |
female | yes | no | no | no | yes | 133 |
female | yes | no | no | yes | no | 596 |
male | no | no | no | no | no | 698 |
male | no | no | no | no | yes | 75 |
male | no | no | no | yes | no | 274 |
male | no | no | yes | no | no | 68 |
male | no | no | yes | no | yes | 76 |
male | no | no | yes | yes | no | 130 |
male | no | yes | no | no | no | 92 |
male | no | yes | no | no | yes | 16 |
male | no | yes | no | yes | no | 30 |
male | yes | no | no | no | no | 558 |
male | yes | no | no | no | yes | 98 |
male | yes | no | no | yes | no | 373 |
docVisits
- Most factors are balanced in count. Except for
freepoor
andlchronic
. - The counts for
freepoor
are heavily skewed towardsno
, which is expected in a general population. -
lchronic
has a very small count foryes
, which may be a problem if we want to study this group. - The proportion of chronic sufferers is low both among those who opt for
freepoor
visits, and for those who do not. - The combinations of Qual Variables are very numerous. All we can say is that the counts are very dispersed. But that may be OK, if your Question of interest does not involve those combinations.
What is the most important dialogue uttered in the movie “Sholay”?
- Which are the important Qual variables for your study?
- Are the counts with respect the levels of these Qual variables nearly identical? Or is the data skewed towards certain levels?
- Are there any levels that have very few observations?
- Are there any levels that are missing altogether?
- What combinations of levels are relevant for your study?
- Are there any combinations of levels that are missing altogether?
10 Summarise Quant Variables
How about summaries for Quant variables?
- Mean
visits
are 0.3, with a highsd
of 0.8, indicating a highly skewed distribution. This is confirmed by themin
of 0 andmax
of 9 visits in a 2-week period. -
income
has a mean of \(0.5\), and asd
of \(0.368\), with amin
of \(0.01\) and amax
of \(1.5\). This indicates a reasonable spread of income in the dataset. From theskimr::skim()
output in Section 8, we see that the distribution ofincome
is skewed, but not terribly so.
- Do different Quant variables have very different ranges? If so, you may have to \(scale/normalize\) them for computational ease.
- Do the means and medians differ significantly? If so, the distribution may be skewed, and you may have to use the
median
as a more robust measure of central tendency. And also usequartiles
to summarize the spread of the data. Transformations such aslog
orsqrt
may also help.
11 Grouped Summaries
11.1 Why Grouped Summaries?
We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum
of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?
Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group. This will give us an idea whether different segments of the population, as defined by Qual variables and their levels, are relatively similar, or if there are significant differences between groups.
11.2 Creating Group Summaries
We can use dplyr::group_by()
to make groups in the data, and then use dplyr::summarize()
to get the summaries we need.
The package crosstable allows us to rapidly summarize multiple variables grouped and split by other variables, and presents the results in an elegant form. It also conveniently uses the formula interface that makes the code very crisp, and which we will be encountering with other important packages too. We will find occasion to meet crosstable again when we do Inference.
# library(crosstable)
crosstable(visits + income ~ gender + freepoor,
data = docVisits_modified
) %>%
crosstable::as_flextable()
freepoor |
no |
yes |
|||
---|---|---|---|---|---|
gender |
female |
male |
female |
male |
|
visits |
Min / Max |
0 / 8.0 |
0 / 9.0 |
0 / 5.0 |
0 / 7.0 |
Med [IQR] |
0 [0;0] |
0 [0;0] |
0 [0;0] |
0 [0;0] |
|
Mean (std) |
0.4 (0.9) |
0.2 (0.7) |
0.2 (0.8) |
0.1 (0.6) |
|
N (NA) |
2618 (0) |
2350 (0) |
84 (0) |
138 (0) |
|
income |
Min / Max |
0 / 1.5 |
0 / 1.5 |
0 / 1.1 |
0 / 1.1 |
Med [IQR] |
0.3 [0.2;0.7] |
0.7 [0.3;0.9] |
0.2 [0.1;0.3] |
0.2 [0.1;0.5] |
|
Mean (std) |
0.5 (0.3) |
0.7 (0.4) |
0.2 (0.2) |
0.3 (0.2) |
|
N (NA) |
2618 (0) |
2350 (0) |
84 (0) |
138 (0) |
visits
and income
over gender
and freepoor
(The as_flextable
command from the crosstable package helped to render this elegant HTML table we see. It should be possible to do Word/PDF also, which we might see later.)
- Average visits for female
gender
patients seem to be higher - Average visits for
freepoor
patients seem to be higher ( and their mean income if lower of course) - Income for female
gender
patients seems to be lower -
Median
visits
are \(0\) !! Clearly, most people do not visit the doctor in a 2-week period. - Clearly the people who are
freepoor
( On Govt Insurance) AND with a chronic condition are those who have lower average income and a higher average number of visits to the doctor…but there are relatively few of them (n = 55) in this dataset…
- Are there differences in mean of Quant variables across levels of single or multiple Qual variables? This could be a first look at whether the population is fractured into sub-groups and could be a point of research interest. E.g. disparity in interest in a product across groups.
- Does
sd
differ significantly across groups? This could indicate that some groups are more heterogeneous than others, and may need to be studied further.
12 Summaries and Uncertainty
So, are we sure these summaries speak the truth? Are they accurate? Do they really represent a truth about the population from which this data sample was drawn?
We will need to deal with these ideas when we get to Inferential Statistics. For now, we will just note that the summaries we have obtained are sample statistics. We will need to perform some analysis to understand how well these sample statistics represent the population statistics.
13 More on dplyr
The dplyr package is capable of doing much more than just count
, group_by
and summarize
. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on dplyr is here:
dplyr Tutorial |
---|
14 Your Turn
Which would be the Group By
variables here? And what would you summarize? With which function?
```{r}
library(CardioDataSets)
data(package = "CardioDataSets") # Lists datasets in the package
```
```{r}
library(NeuroDataSets)
data(package = "NeuroDataSets") # Lists datasets in the package
```
```{r}
library(Lock5Data)
library(Lock5withR)
data(package = "Lock5Data") # Lists datasets in the package
data(package = "Lock5withR") # Lists datasets in the package
```
15 Wait, But Why?
- Data Summaries give you the essentials, without getting bogged down in the details(just yet).
- Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
- Summaries help evoke Questions and Hypotheses about the population, which may lead to inquiries, analysis, and insights
-
Grouped Summaries should tell you if:
- counts of groups in your target audience are lopsided/imbalanced; Go and Get your data again.
- there are visible differences in Quant data across groups, so your target audience could be nicely fractured;
- etc.
16 Conclusion
-
mosaic::inspect()
,skimr::skim()
anddplyr::glimpse()
give us an overall summary of our data. - Using
dplyr::count()
we can get counts of levels of Qual variables, and combinations of levels of multiple Qual variables. - With
dplyr::summarise()
we can get summary statistics of Quant variables, singly or in pairs, or even all together. - Using
dplyr::group_by()
we can group the data by levels of one or more Qual variables, and then usedplyr::summarise()
to get summary statistics of Quant variables within each group. -
crosstable::crosstable()
can also be used to get grouped summaries of multiple Quant variables over Qual variables, using the formula interface.
Make these part of your Workflow.
17 AI Generated Summary and Podcast
This is a tutorial on using the R programming language to perform descriptive statistical analysis on data sets. The tutorial focuses on summarizing data using various R packages like dplyr and crosstable. It emphasizes the importance of understanding the data’s structure, identifying different types of variables (qualitative and quantitative), and calculating summary statistics such as means, medians, and frequencies. The tutorial provides examples using real datasets and highlights the significance of data summaries in gaining initial insights, formulating research questions, and identifying potential issues with the data.
18 References
- Lock, Lock, Lock, Lock, and Lock. (2021). Statistics: Unlocking the Power of Data, 3rd Edition). https://media.wiley.com/product_data/excerpt/69/11196821/1119682169-32.pdf
Package | Version | Citation |
---|---|---|
CardioDataSets | 0.2.0 | Caceres Rossi (2025a) |
crosstable | 0.8.2 | Chaltiel (2025) |
janitor | 2.2.1 | Firke (2024) |
Lock5Data | 3.0.0 | Lock (2021) |
Lock5withR | 1.2.2 | Pruim (2015) |
mosaic | 1.9.2 | Pruim, Kaplan, and Horton (2017) |
NeuroDataSets | 0.2.0 | Caceres Rossi (2025b) |
skimr | 2.2.1 | Waring et al. (2025) |
Citation
@online{v.2023,
author = {V., Arvind},
title = {\textless Iconify-Icon Icon=“carbon:summary-Kpi”
Width=“1.2em”
Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
{Summaries}},
date = {2023-10-15},
url = {https://madhatterguide.netlify.app/content/courses/Analytics/10-Descriptive/Modules/10-FavStats/},
langid = {en},
abstract = {Bill Gates walked into a bar, and everyone’s salary went
up on average.}
}