Groups

Outliers and other Crazy Things

Arvind V.

2024-06-24

“In keeping silent about evil, in burying it so deep within us that no sign of it appears on the surface, we are implanting it, and it will rise up a thousand fold in the future.”

— Aleksandr Solzhenitsyn

Setting up R Packages

library(tidyverse)
library(mosaic)
library(ggformula)
library(skimr)
library(janitor) # Data cleaning and tidying package
library(visdat) # Visualize whole dataframes for missing data
library(naniar) # Clean missing data
library(DT) # Interactive Tables for our data
library(tinytable) # Elegant Tables for our data
library(ggrepel) # Repel overlapping text labels in ggplot
library(marquee) # Annotations in ggplot

Plot Fonts and Theme

Code

library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    # theme(panel.widths = unit(11, "cm"),
    #       panel.heights = unit(6.79, "cm")) + # Golden Ratio

    theme(
      plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
      plot.background = element_rect(
        fill = "bisque",
        colour = "black",
        linewidth = 1
      )
    ) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 10
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 8
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

What graphs will we see today?

Variable #1	Variable #2	Chart Names	Chart Shape
Quant	Qual	Box Plot

What kind of Data Variables will we choose?

No	Pronoun	Answer	Variable/Scale	Example	What Operations?
1	How Many / Much / Heavy? Few? Seldom? Often? When?	Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful.	Quantitative/Ratio	Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate	Correlation
4	What, Who, Where, Whom, Which	Name, Place, Animal, Thing	Qualitative/Nominal	Name	Count no. of cases,Mode

Inspiration

Alice said, “I say what I mean and I mean what I say!” Are the rest of us so sure? What do we mean when we use any of the phrases above? How definite are we? There is a range of “sureness” and “unsureness”…and this is where we can use box plots like Figure 1 to show that range of opinion.

Maybe it is time for a box plot on uh, shades¹ of meaning for ~~Jane Austen~~ Gen-Z phrases! Bah.

How do these Chart(s) Work?

Box Plots are an extremely useful data visualization that gives us an idea of the distribution of a Quant variable, for each level of another Qual variable.

How are Boxplots Computed?

The internal process of this plot is as follows:

(Hat tip to student Tanya Michelle Justin for a good question on outlier calculation)

Make groups of the Quant variable for each level of the Qual
In each group, rank the Quant variable values in increasing order
Calculate:
- The values for median = Q2, Q1, and Q3 based on rank!!
- Values for min, max, and then IQR = Q1 - Q3
- Calculate outlier limits:
  - \([Q1 - 1.5*IQR, Q2 + 1.5*IQR]\)
- Whiskers: All values within \([Q1 - 1.5*IQR, Q2 + 1.5*IQR]\)
- Outliers: All values outside of \([Q1 - 1.5*IQR, Q2 + 1.5*IQR]\)
Plot these as a vertical or horizontal box structure, as shown.

As a result of this, while the box-part of the boxplot always shows 2 full quartiles, the whiskers may not stretch through their quartiles, since some values may be outliers on either side.

Ranks and Values

The Quant variable is ordered based on the values from min to max. So you could imagine that each value has a rank or sequence number. The min value has \(rank = 1\) and the max value has \(rank = length(var)\).

Histograms and Box Plots

Note how the histogram that dwells upon the mean and standard deviation, whereas the boxplot focuses on the median and quartiles.

The former uses the values of the Quant variable, whereas the latter uses their sequence number or ranks.

Box plots are often used for example in HR operations to understand Salary distributions across grades of employees. Marks of students in competitive exams are also declared using Quartiles.

Box Plots and Skewness

Boxplots can be symmetric or show evidence of skew in the data. In such cases the box will typically the two halves of different “sizes”, since these two Quartiles span different ranges in value.

In the Figure 3 (a), we see the difference between boxplots that show symmetric and skewed distributions. The “lid” and the “bottom” of the box are not of similar width in distributions with significant skewness.

Compare these with the corresponding distributions in Figure 3 (b).

Box Plots and Outliers

Box plots can show the presence of outliers in distributions, with a large number of outliers on one side, as in Figure 4.

Plotting Box Plots

We will first look at Wage data from the General Social Survey (1974-2018) conducted in the USA, which is used to illustrate wage discrepancies by gender (while also considering respondent occupation, age, and education). This is available on Vincent Arel-Bundock’s superb repository of datasets. Let us read into R directly from the website.

Case Study-1: `gss_wages` dataset

Read Data

wages <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/gss_wages.csv")

Rows: 61697 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): occrecode, wrkstat, gender, educcat, maritalcat
dbl (7): rownames, year, realrinc, age, occ10, prestg10, childs

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data has automatically been read into the webr session, so you can continue on to the next code chunk!

#| context: setup

ggplot2::theme_set(theme_classic())
# Read the data
wages <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/gss_wages.csv")

Inspect Data

As per our Workflow, we will look at the data using all the three methods we have seen.

glimpse(wages)

Rows: 61,697
Columns: 12
$ rownames   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ year       <dbl> 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974,…
$ realrinc   <dbl> 4935, 43178, NA, NA, 18505, 22206, 55515, NA, NA, 4935, NA,…
$ age        <dbl> 21, 41, 83, 69, 58, 30, 48, 67, 51, 54, 89, 71, 27, 30, 22,…
$ occ10      <dbl> 5620, 2040, NA, NA, 5820, 910, 230, 6355, 4720, 3940, 4810,…
$ occrecode  <chr> "Office and Administrative Support", "Professional", NA, NA…
$ prestg10   <dbl> 25, 66, NA, NA, 37, 45, 59, 49, 28, 38, 47, 45, 50, 29, 33,…
$ childs     <dbl> 0, 3, 2, 2, 0, 0, 2, 1, 2, 2, 3, 1, 4, 3, 0, 1, 2, 3, 4, 8,…
$ wrkstat    <chr> "School", "Full-Time", "Housekeeper", "Housekeeper", "Full-…
$ gender     <chr> "Male", "Male", "Female", "Female", "Female", "Male", "Male…
$ educcat    <chr> "High School", "Bachelor", "Less Than High School", "Less T…
$ maritalcat <chr> "Married", "Married", "Widowed", "Widowed", "Never Married"…

skim(wages)

Data summary
Name	wages
Number of rows	61697
Number of columns	12
_______________________
Column type frequency:
character	5
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
occrecode	3561	0.94	5	37	11
wrkstat	21	1.00	5	23	8
gender	0	1.00	4	6	2
educcat	135	1.00	8	21	5
maritalcat	27	1.00	7	13	5

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rownames	0	1.00	30849.00	17810.53	1	15425	30849	46273	61697.0	▇▇▇▇▇
year	0	1.00	1996.07	12.79	1974	1985	1996	2006	2018.0	▆▇▇▇▇
realrinc	23810	0.61	22326.36	28581.79	227	8156	16563	27171	480144.5	▇▁▁▁▁
age	219	1.00	46.18	17.56	18	32	44	59	89.0	▇▇▆▅▂
occ10	3561	0.94	4695.77	2627.72	10	2710	4720	6230	9997.0	▃▅▇▂▃
prestg10	4186	0.93	43.06	12.99	16	33	42	50	80.0	▃▇▇▃▁
childs	189	1.00	1.92	1.76	0	0	2	3	8.0	▇▇▂▁▁

inspect(wages)


categorical variables:  
        name     class levels     n missing
1  occrecode character     11 58136    3561
2    wrkstat character      8 61676      21
3     gender character      2 61697       0
4    educcat character      5 61562     135
5 maritalcat character      5 61670      27
                                   distribution
1 Professional (19%), Service (16.9%) ...      
2 Full-Time (49.4%), Housekeeper (15.1%) ...   
3 Female (56.1%), Male (43.9%)                 
4 High School (51.5%) ...                      
5 Married (51.7%), Never Married (21.8%) ...   

quantitative variables:  
      name   class  min    Q1 median    Q3      max         mean           sd
1 rownames numeric    1 15425  30849 46273  61697.0 30849.000000 17810.534116
2     year numeric 1974  1985   1996  2006   2018.0  1996.073715    12.794470
3 realrinc numeric  227  8156  16563 27171 480144.5 22326.359234 28581.794499
4      age numeric   18    32     44    59     89.0    46.176177    17.561065
5    occ10 numeric   10  2710   4720  6230   9997.0  4695.774081  2627.724076
6 prestg10 numeric   16    33     42    50     80.0    43.060701    12.987526
7   childs numeric    0     0      2     3      8.0     1.923457     1.763569
      n missing
1 61697       0
2 61697       0
3 37887   23810
4 61478     219
5 58136    3561
6 57511    4186
7 61508     189

#| label: glimpse-wages-webr
wages %>%
  glimpse()

#| label: skim-wages-webr
wages %>% 
  skim()

#| label: inspect-wages-webr
wages %>% 
  inspect()

Much data is missing in the target variable realinc (income).
Good mix of Qual and Quant variables

Data Munging

Since there are so many missing data in the target variable realinc and there is still enough data leftover, we can drop the missing values in that variable. This is not advised at all as a general procedure!! Data is valuable and there are better ways to manage this problem!

Important

It turns out that following our process of munging the data using the {naniar} package to replace common NA strings and numbers with actual NA makes the code run very slowly. I need to figure out how to optimize this.

Code

wages_clean <- wages %>%
  tidyr::drop_na(realrinc) %>% # choose column or leave blank to choose all columns

  janitor::clean_names(case = "snake") %>% # clean names

  dplyr::mutate(across(where(is.character), as_factor)) %>% # make factors
  dplyr::mutate(childs = as_factor(childs)) %>% # make childs a factor too
  dplyr::relocate(where(is.factor), .after = rownames) # move factors to the right of rownames

glimpse(wages_clean)

Rows: 37,887
Columns: 12
$ rownames   <dbl> 1, 2, 5, 6, 7, 10, 15, 16, 21, 26, 27, 30, 31, 33, 34, 36, …
$ occrecode  <fct> "Office and Administrative Support", "Professional", "Offic…
$ childs     <fct> 0, 3, 0, 0, 2, 2, 0, 1, 0, 1, 4, 3, 2, 0, 4, 2, 4, 0, 1, 0,…
$ wrkstat    <fct> "School", "Full-Time", "Full-Time", "School", "Full-Time", …
$ gender     <fct> Male, Male, Female, Male, Male, Female, Female, Male, Male,…
$ educcat    <fct> High School, Bachelor, High School, Bachelor, Graduate, Les…
$ maritalcat <fct> Married, Married, Never Married, Married, Married, Married,…
$ year       <dbl> 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974, 1974,…
$ realrinc   <dbl> 4935, 43178, 18505, 22206, 55515, 4935, 4935, 18505, 11103,…
$ age        <dbl> 21, 41, 58, 30, 48, 54, 22, 23, 25, 59, 53, 35, 51, 21, 72,…
$ occ10      <dbl> 5620, 2040, 5820, 910, 230, 3940, 4020, 7810, 8640, 4710, 4…
$ prestg10   <dbl> 25, 66, 37, 45, 59, 38, 33, 28, 31, 48, 31, 47, 39, 51, 38,…

Data Dictionary

From the dataset documentation page, we note that this is a large dataset (61K rows), with 11 variables:

Qualitative Data

occrecode(fct): recode of the occupation code into one of 11 main categories
wrkstat(fct): the work status of the respondent (full-time, part-time, temporarily not working, unemployed (laid off), retired, school, housekeeper, other). 8 levels.
gender(fct): respondent’s gender (male or female). 2 levels.
educcat(fct): respondent’s degree level (Less Than High School, High School, Junior College, Bachelor, or Graduate). 5 levels.
maritalcat(fct): respondent’s marital status (Married, Widowed, Divorced, Separated, Never Married). 5 levels.
childs(dbl): number of children (0-8)

Quantitative Data

year(dbl): the survey year
realrinc(dbl): the respondent’s base income (in constant 1986 USD
age(dbl): the respondent’s age in years
occ10(dbl): respondent’s occupation code (2010)
prestg10(dbl): respondent’s occupational prestige score (2010)

Data Table

Code

wages_clean %>%
  DT::datatable(
    caption = htmltools::tags$caption(
      style = "caption-side: top; text-align: left; color: black; font-size: 150%;",
      "GSS Wages Dataset (Clean)"
    ),
    options = list(pageLength = 10, autoWidth = TRUE)
  ) %>%
  DT::formatStyle(
    columns = names(wages_clean),
    fontFamily = "Roboto Condensed",
    fontSize = "12px"
  )

Table 1: GSS Wages Clean Dynamic Data Table

Experiment Description

The target variable for an experiment that resulted in this data might be the realinc variable, the resultant income of the individual. Which is numerical variable.
The predictor variables could be gender, educcat, maritalcat, and wrkstat (all categorical variables), and age and childs (numerical variables).
Note that childs is a coded as numerical variable, but it is actually a count of children, and so it can be treated as a categorical variable with 9 levels (0 to 8). We have done that above.

Research Questions:

What is the basic distribution of realrinc?
Is realrinc affected by gender?
By educcat? By maritalcat?
Is realrinc affected by child?
Do combinations of these factors have an effect on the target variable?

These should do for now! But we should make more questions when have seen some plots!

The Monkey Grammarian’s Note

See that preposition “by” in the questions above? It is a good idea to use it when you are asking about the effect of a Qual variable on a Quant variable. And we know that Qual variables are the ones that define…groups! So we are asking about the effect of groups on a Quant variable.

Plotting Box Plots

Question-1: What is the basic distribution of `realrinc`?

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(realrinc ~ "Overall", orientation = "x") %>% # Dummy X-axis "variable"
  gf_labs(
    title = "Plot 1A: Income has a skewed distribution",
    subtitle = "Many outliers on the high side",
    x = "", y = "Income"
  ) # Blank out the X-axis title

Code

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  ggplot() +
  geom_boxplot(aes(y = realrinc, x = "Income")) + # Dummy X-axis "variable"
  labs(
    title = "Plot 1A: Income has a skewed distribution",
    subtitle = "Many outliers on the high side"
  )


wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(realrinc ~ "Income",orientation = "x") %>% # Dummy X-axis "variable"
  gf_labs(title = "Plot 1A: Income has a skewed distribution",
          subtitle = "Many outliers on the high side")
##
wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>%  
  ggplot() + 
  geom_boxplot(aes(y = realrinc, x = "Income")) +  # Dummy X-axis "variable"
  labs(title = "Plot 1A: Income has a skewed distribution",
          subtitle = "Many outliers on the high side")

Income is a very skewed distribution, as might be expected.
Presence of many higher-side outliers is noted.

Question-2: Is `realrinc` affected by `gender`?

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(gender ~ realrinc, orientation = "y") %>%
  gf_labs(y = "Gender", x = "Income", title = "Plot 2A: Income by Gender")

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(gender ~ log10(realrinc), orientation = "y") %>%
  gf_labs(y = "Gender", x = "Income", title = "Plot 2B: Log(Income) by Gender")

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(gender ~ realrinc, fill = ~gender, orientation = "y") %>%
  gf_refine(scale_x_log10(), scale_fill_brewer(palette = "Set1")) %>%
  gf_labs(y = "Gender", x = "Income", title = "Plot 2C: Income filled by Gender, log scale")

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  ggplot() +
  geom_boxplot(aes(y = gender, x = realrinc)) +
  labs(title = "Plot 2A: Income by Gender")
##
wages_clean %>%
  ggplot() +
  geom_boxplot(aes(y = gender, x = log10(realrinc))) +
  labs(title = "Plot 2B: Log(Income) by Gender")
##
wages_clean %>%
  ggplot() +
  geom_boxplot(aes(y = gender, x = realrinc, fill = gender)) +
  scale_x_log10() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Plot 2C: Income filled by Gender, log scale")

wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(gender ~ realrinc,orientation = "y") %>% 
  gf_labs(title = "Plot 2A: Income by Gender")
##
wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(gender ~ log10(realrinc),orientation = "y") %>% 
  gf_labs(title = "Plot 2B: Log(Income) by Gender")
##
wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(gender ~ realrinc, fill = ~ gender,orientation = "y") %>% 
  gf_refine(scale_x_log10(), scale_fill_brewer(palette = "Set1")) %>% 
  gf_labs(title = "Plot 2C: Income filled by Gender, log scale")

Even when split by gender, realincome presents a skewed set of distributions.
The IQR for males is smaller than the IQR for females. There is less variation in the middle ranges of realrinc for men.
log10 transformation helps to view and understand the regions of low realrinc.
There are outliers on both sides, indicating that there may be many people who make very small amounts of money and large amounts of money in both genders.

Question-3: Is `realrinc` affected by `educcat`?

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(educcat ~ realrinc, orientation = "y") %>%
  gf_labs(title = "Plot 3A: Income by Education Category")

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(educcat ~ log10(realrinc), orientation = "y") %>%
  gf_labs(title = "Plot 3B: Log(Income) by Education Category")

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(
    reorder(educcat, realrinc, FUN = median) ~ log(realrinc),
    fill = ~educcat, alpha = 0.5, orientation = "y"
  ) %>%
  gf_labs(
    title = "Plot 3C: Log(Income) by Education Category, sorted",
    x = "Log Income",
    y = "Education Category"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1"))

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(reorder(educcat, realrinc, FUN = median) ~ realrinc,
    fill = ~educcat, orientation = "y",
    alpha = 0.5
  ) %>%
  gf_refine(scale_x_log10()) %>%
  gf_labs(
    title = "Plot 3D: Income by Education Category, sorted",
    subtitle = "Log Income",
    x = "Income",
    y = "Education Category"
  ) %>%
  gf_refine(scale_fill_brewer(palette = "Set1"))

Note that educcat has been sorted in the last two plots, by the median value of realrinc in each category. This makes it easier to see the trend. And that educcat has a “NA” level as well, that indicates people who did not report their education level. This is plotted at one end of the box plot by default. (regardless of how the boxplots are sorted. Try reordering in desc() order to see what happens!)

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  ggplot() +
  geom_boxplot(aes(realrinc, educcat)) + # (x,y) format
  labs(title = "Plot 3A: Income by Education Category")
##
wages_clean %>%
  ggplot() +
  geom_boxplot(aes(log10(realrinc), educcat)) +
  labs(title = "Plot 3B: Log(Income) by Education Category")
##
wages_clean %>%
  ggplot() +
  geom_boxplot(
    aes(log(realrinc),
      reorder(educcat, realrinc, FUN = median),
      fill = educcat
    ),
    alpha = 0.5
  ) +
  labs(
    title = "Plot 3C: Log(Income) by Education Category, sorted",
    x = "Log Income", y = "Education Category"
  )
##
wages_clean %>%
  ggplot() +
  geom_boxplot(
    aes(realrinc,
      reorder(educcat, realrinc, FUN = median),
      fill = educcat
    ),
    alpha = 0.5
  ) +
  scale_x_log10() +
  labs(
    title = "Plot 3D: Income by Education Category, sorted",
    subtitle = "Log Income Scale",
    x = "Income", y = "Education Category"
  )


wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(educcat ~ realrinc,orientation = "y") %>% 
  gf_labs(title = "Plot 3A: Income by Education Category")
##
wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(educcat ~ log10(realrinc),orientation = "y") %>% 
  gf_labs(title = "Plot 3B: Log(Income) by Education Category")
##
wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(reorder(educcat, realrinc, FUN = median) ~ log(realrinc), 
             fill = ~ educcat, orientation = "y",
             alpha = 0.5) %>% 
  gf_labs(title = "Plot 3C: Log(Income) by Education Category, sorted",
          x = "Log Income", y = "Education Category")
##
wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  tidyr::drop_na(realrinc) %>% 
  gf_boxplot(reorder(educcat, realrinc, FUN = median) ~ realrinc, 
             fill = ~ educcat,orientation = "y",
             alpha = 0.5) %>% 
  gf_refine(scale_x_log10()) %>% 
  gf_labs(title = "Plot 3D: Income by Education Category, sorted",
          subtitle = "Log Income Scale",
          x = "Income", y = "Education Category")

realrinc rises with educcat, which is to be expected.
However, there are people with very low and very high income in all categories of educcat
Hence educcat alone may not be a good predictor for realrinc.

We can do similar work with the other Qual variables. Let us now see how we can use more than one Qual variable and answer the last hypothesis, Question 4.

Question-4: Is the target variable `realrinc` affected by combinations of Qual factors `gender`, `educcat`, `maritalcat` and `childs`?

Important

This is a rather complex question and could take us deep into Modelling. Ideally we ought to:

take each Qual variable, explain its effect on the target variable
remove that effect and model the remainder ( i.e. residual) with the next Qual variable
Proceed in this way until we have a good model.
if we are going to do this manually.

There are more modern Modelling Workflows, that can do things much faster and without such manual tweaking.

So will simply plot box plots showing effects on the target variable of combinations of Qual variables taken two at a time. (We will of course use facetted box plots!)

We will also drop NA values all around this time, to avoid seeing boxplots for undocumented categories.

Question-4: Is `realrinc` affected by combinations of factors?

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  gf_boxplot(reorder(educcat, realrinc) ~ log10(realrinc),
    fill = ~educcat, orientation = "y",
    alpha = 0.5
  ) %>%
  gf_facet_wrap(vars(childs)) %>%
  gf_refine(scale_fill_brewer(type = "qual", palette = "Dark2")) %>%
  gf_labs(
    title = "Plot 4A: Log Income by Education Category and Family Size",
    x = "Log income",
    y = "No. of Children"
  )

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  mutate(childs = as_factor(childs)) %>%
  gf_boxplot(childs ~ log10(realrinc),
    group = ~childs,
    fill = ~childs, orientation = "y",
    alpha = 0.5
  ) %>%
  gf_facet_wrap(~gender) %>%
  gf_refine(scale_fill_brewer(type = "qual", palette = "Set3")) %>%
  gf_labs(
    title = "Plot 4B: Log Income by Gender and Family Size",
    x = "Log income",
    y = "No. of Children"
  )

ggplot2::theme_set(new = theme_custom())

wages_clean %>%
  ggplot() +
  geom_boxplot(
    aes(log10(realrinc), reorder(educcat, realrinc),
      fill = educcat
    ), # aes() closes here
    alpha = 0.5
  ) +
  facet_wrap(vars(childs)) +
  scale_fill_brewer(type = "qual", palette = "Dark2") +
  labs(title = "Plot 4A: Log Income by Education Category and Family Size", x = "Log income", y = "No. of Children")
##
wages_clean %>%
  mutate(childs = as_factor(childs)) %>%
  ggplot() +
  geom_boxplot(
    aes(log10(realrinc), childs,
      group = childs,
      fill = childs
    ), # aes() closes here
    alpha = 0.5
  ) +
  facet_wrap(vars(gender)) +
  scale_fill_brewer(type = "qual", palette = "Set3") +
  labs(
    title = "Plot 4B: Log Income by Gender and Family Size",
    x = "Log income",
    y = "No. of Children"
  )

#| label: fig-income-by-educcat-childs-webr

wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  drop_na() %>% 
  gf_boxplot(reorder(educcat, realrinc) ~ log10(realrinc),
             fill = ~ educcat, orientation = "y",
             alpha = 0.5) %>% 
  gf_facet_wrap(vars(childs)) %>% 
  gf_refine(scale_fill_brewer(type = "qual", palette = "Dark2")) %>% 
  gf_labs(title = "Plot 4A: Log Income by Education Category and Family Size", x = "Log income", y = "No. of Children")
##
wages %>% 
  janitor::clean_names(case = "snake") %>% 
  janitor::remove_empty(which = c("rows", "cols")) %>% 
  drop_na() %>% 
  dplyr::mutate(childs = as_factor(childs)) %>% 
  gf_boxplot(childs ~ log10(realrinc),orientation = "y",
             group = ~ childs,
             fill = ~ childs, 
             alpha = 0.5) %>% 
  gf_facet_wrap(~ gender) %>% 
  gf_refine(scale_fill_brewer(type = "qual", palette = "Set3")) %>% 
  gf_labs(title = "Plot 4B: Log Income by Gender and Family Size",
          x = "Log income",
          y = "No. of Children")

We see that realrinc increases with educcat, across (almost) all family sizes childs.
However, this trend breaks a little when family sizes childs is large, say >= 7. Be aware that the data observations for such large families may be sparse and this inference may not be necessarily valid.
We see that the effect of childs on realrinc is different for each gender! For females, the income steadily drops with the number of children, whereas for males it actually increases up to a certain family size before decreasing again.

Are the Differences Significant?

Hunches and Hypotheses

In data analysis, we always want to know¹, as in life, how important things are, whether they matter. To do this, we make up hunches, or more precisely, Hypotheses. We make two in fact:

\(H_0\): Nothing is happening;
\(H_a\): (“a” for Alternate): Something is happening and it is important enough to pay attention to.

We then pretend that \(H_0\) is true and ask that our data prove us wrong; if it does, we reject \(H_0\) in favour of \(H_a\).

This is a very important idea of Hypothesis Testing which helps you justify your hunch. We will study this when we do Stats Tests for differences between two means(t-tests), and those between more than two means(ANOVA).

Wait, But Why?

Box plots are a powerful statistical graphic that give us a combined view of data ranges, quartiles, medians, and outliers.
Box plots can compare groups within our Quant variable, based on levels of a Qual variable. This is a very common and important task in research!
In your design research, you would have numerical Quant data that is accompanied by categorical Qual data pertaining to groups within your target audience.
Analyzing for differences in the Quant across levels of the Qual (e.g household expenditure across groups of people) is a vital step in justifying time, effort, and money for further actions in your project. Don’t faff this.
Box plots are ideal for visualizing statistical tests for difference in mean values across groups (t-test and ANOVA). (Even though they plot medians)

Conclusion

Box Plots “dwell upon” medians and Quartiles
Box Plots can show distributions of a Quant variable over levels of a Qual variable
Box Plots should be sorted by median values of the Quant variable. And coloured too.
Horizontal aligning of multiple Box Plots is preferable! Why?
This allows a comparison of box plots side by side to visibly detect differences in medians and IQRs across such levels.

Your Turn

Here are a couple of datasets that you might want to analyze with box plots:

Insurance Data

Political Donations

UFO Encounters

The data dictionary for this dataset is here at the TidyTuesday Website.. The TidyTuesday Website is a treasure trove of interesting datasets!

GPT-based Language detectors are biased against non-native English writers.

What story can you tell, and deduction can you make from Figure 5 below? How would you replicate it? What would you add?

AI Generated Summary and Podcast

This excerpt from “Groups – Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors” provides a comprehensive guide to understanding and utilizing box plots for data visualization and analysis. The text explores the purpose, functionality, and application of box plots within the context of exploring relationships between quantitative and qualitative variables. The author illustrates these concepts using a case study of the “gss_wages” dataset, examining wage discrepancies by gender, occupation, age, and education. Through this analysis, the author highlights the effectiveness of box plots in visualizing distributions, identifying outliers, and comparing groups, providing valuable insights into the complexities of data. The text concludes with a call to action, encouraging readers to explore real-world datasets and apply these techniques to uncover hidden trends and patterns within data.

What are the relationships between qualitative and quantitative variables in the gss_wages dataset?
How do box plots help visualize and understand the distribution of income across different groups?
What insights can be gained by analyzing the impact of multiple qualitative factors on income distribution?

References

Winston Chang (2024). R Graphics Cookbook. https://r-graphics.org
Bevans, R. (2023, June 22). An Introduction to t Tests | Definitions, Formula and Examples. Scribbr. https://www.scribbr.com/statistics/t-test/
Brown, Angus. (2008). The Strange Origins of the t-test. Physiology News | No. 71 | Summer 2008| https://static.physoc.org/app/uploads/2019/03/22194755/71-a.pdf
Stephen T. Ziliak.(2008). Guinnessometrics: The Economic Foundation of “Student’s” t. Journal of Economic Perspectives—Volume 22, Number 4—Fall 2008—Pages 199–216. https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.22.4.199
https://quillette.com/2024/08/03/xy-athletes-in-womens-olympic-boxing-paris-2024-controversy-explained-khelif-yu-ting/
Senefeld JW, Lambelet Coleman D, Johnson PW, Carter RE, Clayburn AJ, Joyner MJ. Divergence in Timing and Magnitude of Testosterone Levels Between Male and Female Youths. JAMA. 2020;324(1):99–101. doi:10.1001/jama.2020.5655. https://jamanetwork.com/journals/jama/fullarticle/2767852
Doriane Lambelet Coleman.(2017) Sex in Sport, 80 Law and Contemporary Problems. Available at: https://scholarship.law.duke.edu/lcp/vol80/iss4/5
Distributome - An Interactive Web-based Resource for Probability Distributions https://distributome.org

R Package Citations

Package	Version	Citation
naniar	1.1.0	Tierney and Cook (2023)
sn	2.1.1	Azzalini (2023)
TeachHist	0.2.1	Lange (2023)
TeachingDemos	2.13	Snow (2024)
tinytable	0.13.0	Arel-Bundock (2025)
visdat	0.6.0	Tierney (2017)
visualize	4.5.0	Balamuta (2023)

Arel-Bundock, Vincent. 2025. tinytable: Simple and Configurable Tables in “HTML,” “LaTeX,” “Markdown,” “Word,” “PNG,” “PDF,” and “Typst” Formats. https://doi.org/10.32614/CRAN.package.tinytable.

Azzalini, Azzalini A. 2023. The R Package sn: The Skew-Normal and Related Distributions Such as the Skew-\(t\) and the SUN (Version 2.1.1). Università degli Studi di Padova, Italia. https://cran.r-project.org/package=sn.

Balamuta, James. 2023. visualize: Graph Probability Distributions with User Supplied Parameters and Statistics. https://doi.org/10.32614/CRAN.package.visualize.

Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://doi.org/10.32614/CRAN.package.TeachHist.

Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://doi.org/10.32614/CRAN.package.TeachingDemos.

Tierney, Nicholas. 2017. “visdat: Visualising Whole Data Frames.” JOSS 2 (16): 355. https://doi.org/10.21105/joss.00355.

Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software 105 (7): 1–31. https://doi.org/10.18637/jss.v105.i07.

Groups

Setting up R Packages

Plot Fonts and Theme

What graphs will we see today?

What kind of Data Variables will we choose?

Inspiration

How do these Chart(s) Work?

How are Boxplots Computed?

Ranks and Values

Histograms and Box Plots

Box Plots and Skewness

Box Plots and Outliers

Plotting Box Plots

Case Study-1: gss_wages dataset

Inspect Data

Data Munging

Data Dictionary

Data Table

Experiment Description

Research Questions:

Plotting Box Plots

Question-1: What is the basic distribution of realrinc?

Question-2: Is realrinc affected by gender?

Question-3: Is realrinc affected by educcat?

Question-4: Is the target variable realrinc affected by combinations of Qual factors gender, educcat, maritalcat and childs?

Question-4: Is realrinc affected by combinations of factors?

Are the Differences Significant?

Wait, But Why?

Conclusion

Your Turn

AI Generated Summary and Podcast

References

R Package Citations

Case Study-1: `gss_wages` dataset

Question-1: What is the basic distribution of `realrinc`?

Question-2: Is `realrinc` affected by `gender`?

Question-3: Is `realrinc` affected by `educcat`?

Question-4: Is the target variable `realrinc` affected by combinations of Qual factors `gender`, `educcat`, `maritalcat` and `childs`?

Question-4: Is `realrinc` affected by combinations of factors?