Facing the Abyss

EDA

Workflow

Descriptive

Author

Arvind V.

Published

October 21, 2023

Modified

September 30, 2025

Abstract

A complete EDA Workflow

1 An EDA / Statistical Analysis Process

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible. And one uses Statistical procedures to help answer those questions quantitatively using models and tests.

2 Set up your Project

Create a new Project in RStudio. File -> New Project -> Quarto Blog
Create a new Quarto document: all your Quarto documents should be in the posts/ folder. See the samples therein to get an idea.
Save the document with a meaningful name, e.g. EDA-Workflow-1.qmd
Create a new folder in the Project for your data files, e.g. data/. This can be at the inside the posts/ folder.
Store all datasets within this folder, and refer to them with relative paths, e.g. ../data/mydata.csv in any other Quarto document in the Project. (../ means “go up one level from the current folder”.)

Now edit the *.qmd file which you are editing for this report to include the following sections, YAML, code chunks, and text as needed.

Download this here document as a Work Template

Hit the </>Code button at upper right to copy/save this very document as a Quarto Markdown template for your work. Delete the text that you don’t need, but keep most of the Sections as they are!

3 Setting up R Packages

Install packages using install.packages() in your Console.
Load up your libraries in a setup chunk.
Set df-print: paged so that (long) data frames print as nice paged tables and do not overrun your HTML page.
Add knitr options to your YAML header, so that all your plots are rendered in high quality PNG format.

title: "My Document"
format: html
df-print: paged
knitr:
  opts_chunk:
    dev: "ragg_png"

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(mosaic)

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum

library(ggformula)
library(ggridges)
library(skimr)


Attaching package: 'skimr'

The following object is masked from 'package:mosaic':

    n_missing

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(GGally)
library(corrplot)

corrplot 0.95 loaded

library(corrgram)


Attaching package: 'corrgram'

The following object is masked from 'package:GGally':

    baseball

The following object is masked from 'package:lattice':

    panel.fill

library(crosstable) # Summary stats tables


Attaching package: 'crosstable'

The following object is masked from 'package:purrr':

    compact

library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(tinytable) # Static tables


Attaching package: 'tinytable'

The following object is masked from 'package:ggplot2':

    theme_void

library(DT) # Interactive tables
library(paletteer) # Colour Palettes for Peasants
##
## Add other packages here as needed, e.g.:
##
## scales/ggprism; # For scales / axes formatting
## ggstats/correlation; For Correlation Analysis
## vcd/vcdExtra/ggalluvial/ggpubr for Qual Data
## sf/tmap/osmplotr/rnaturalearth;
## igraph/tidygraph/ggraph/graphlayouts;
## harrypotter/wesanderson/tayloRswift;timburton;

Messages on Loading Packages

If you would rather not have all the messages and warnings from loading packages, you can set message=FALSE and warning=FALSE in the setup chunk above. But it is a good idea to read these messages at least once, since they often contain useful information about package versions, and any conflicts with other packages.

3.1 Use Namespace based Code

Warning

Did you notice that there are similarly-named function commands from different packages? E.g. chisq.test and fisher.test are in the janitor package and they mask the identically-named commands from {stats}. This is because the last-loaded package does the masking, but all functions from all loaded packages are available!! Just remember always to name your code-command with the package from whence it came! So use dplyr::filter() / dplyr::summarize() and not just filter() or summarize(), since these commands could exist across multiple packages, which you may have loaded last. And you will avoid a lot of coding grief.

(One can also use the conflicted package to set this up, but this is simpler for beginners like us. )

Themes and Fonts

Set up a theme for your plots. This is a good time to set up your own theme, or use an existing one, e.g. ggprism, ggthemes, ggpubr, etc. If you have a Company logo, you can use that as a theme too.

Show the Code

# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes
### https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# We have locally downloaded the `Alegreya` and `Roboto Condensed` fonts.
# They are located in a folder labelled `fonts` at the project root level.
# This ensures we are GDPR-compliant, and not using Google Fonts directly.
# Let us import these local fonts into our session and use them to define our ggplot theme.
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 8
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

4 Read Data

Use readr::read_csv(); or data(...) if the data is in a package. - Do not use read.csv()!!

data(penguins, package = "datasets")

penguins <- penguins %>% janitor::clean_names(case = "snake") # Part of the process

penguins

5 Examine Data

Use dplyr::glimpse()
Use skimr::skim() OR mosaic::inspect()
Use dplyr::summarise()
Use crosstable::crosstable()
Highlight any interesting summary stats or data imbalances, especially across groups

dplyr::glimpse(penguins)

Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

skimr::skim(penguins)

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_len	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_dep	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_len	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

penguins %>%
  crosstable::crosstable(body_mass + bill_len + bill_dep ~ species) %>%
  as_flextable() %>%
  flextable::theme_vader() %>% # Darth Vadhyaar theme
  flextable::fontsize(size = 10, part = "all") %>%
  flextable::autofit()

label	variable	species
label	variable	Adelie	Chinstrap	Gentoo
body_mass	Min / Max	2850.0 / 4775.0	2700.0 / 4800.0	3950.0 / 6300.0
	Med [IQR]	3700.0 [3350.0;4000.0]	3700.0 [3487.5;3950.0]	5000.0 [4700.0;5500.0]
	Mean (std)	3700.7 (458.6)	3733.1 (384.3)	5076.0 (504.1)
	N (NA)	151 (1)	68 (0)	123 (1)
bill_len	Min / Max	32.1 / 46.0	40.9 / 58.0	40.9 / 59.6
	Med [IQR]	38.8 [36.8;40.8]	49.5 [46.3;51.1]	47.3 [45.3;49.5]
	Mean (std)	38.8 (2.7)	48.8 (3.3)	47.5 (3.1)
	N (NA)	151 (1)	68 (0)	123 (1)
bill_dep	Min / Max	15.5 / 21.5	16.4 / 20.8	13.1 / 17.3
	Med [IQR]	18.4 [17.5;19.0]	18.4 [17.5;19.4]	15.0 [14.2;15.7]
	Mean (std)	18.3 (1.2)	18.4 (1.1)	15.0 (1.0)
	N (NA)	151 (1)	68 (0)	123 (1)

Insights from Data Examination of penguins

body_mass: Gentoo penguins look to be the heaviest, followed by Chinstrap and then Adelie
bill_length: Adelie penguins have the shortest bills.
bill_depth: No huge difference in bill depth across species

Etc.

6 Data Dictionary and Experiment Description

Data Dictionary: A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord…)
If there are wrongly coded variables in the original data, state them in their correct form, so you can munge the in the next step
Declare what might be target and predictor variables, based on available information of the experiment, or a description of the data.

Qualitative Variables

Categorical variables, e.g. species, island, sex
Use dplyr::count() to get counts of each category

Quantitative Variables

Continuous variables, e.g. body_mass_g, flipper_length_mm, bill_length_mm
Use dplyr::summarise() to get summary statistics of each variable

7 Data Munging

Drop NA data ( since that is all we can do now. Sigh. )
Convert variables to factors as needed
Reformat / Rename other variables as needed
Clean badly formatted columns (e.g. text + numbers) using tidyr::separate_**_**()
Save the data as a modified file
Do not mess up the original data file
One Final Cleaned Data Table:
- Format your tables with tinytable::tt() OR knitr::kable()
- If you want interactive go with DT::datatable()

```{r}
#| label: data-munging

dataset_modified <- data %>% 
  janitor::clean_names(case = "snake") %>% # clean names
  
  dplyr::filter(!is.na(target_variable)) %>% # drop NA in target variable or leave blank for all columns
  
  dplyr::mutate(across(where(is.character), as.factor)) %>% 
  # Convert character variables to factors
  # Check if other integer variables are actually factors
  
  dplyr::relocate(where(is.factor), .after = rownames) # move factors to the right of rownames
# And so on

dataset_modified %>% tinytable::tt()

```

Munge the variables separately using base::factor(levels = ..., labels = ..., ordered = ...) if you need to specify factor labels and levels for each variable.

8 Form Hypotheses

8.1 Question-1

State the Question or Hypothesis
(Temporarily) Choose relevant variables using dplyr::select()
Create new variables if needed with dplyr::mutate()
Filter the data set using dplyr::filter()
Reformat data if needed with tidyr::pivot_longer() or tidyr::pivot_wider()
Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
Use title, subtitle, legend and scales appropriately in your chart
Prefer ggformula unless you are using a chart that is not yet supported therein (eg. ggbump() or plot_likert())

## Set graph theme
## Safe to set this up every chunk
theme_set(new = theme_custom())

penguins %>%
  tidyr::drop_na() %>%
  gf_point(body_mass ~ flipper_len,
    colour = ~species
  ) %>%
  gf_labs(
    title = "My First Penguins Plot",
    subtitle = "Using ggformula with fonts",
    x = "Flipper Length mm", y = "Body Mass gms",
    caption = "I love penguins, and R\n But Arvind's class is another matter altogether"
  )

8.2 Inference-1

Description or Surprise Insight from the above graph/table/test/model, e.g.:

Gentoo penguins are the largetst, followed by Chinstrap and then Adelie
There is a positive correlation between flipper length and body mass, as seen by the general upward-right movement of the data points
And so on… . . .

8.3 Question-n

State the Question or Hypothesis

8.4 Inference-n

Description or Surprise Insight from the above graph/table/test/model

. . . .

9 Conclusion

Describe what the graph/table/test/models show and why it is all so interesting. What could be done next?

10 References

Nicola Rennie.(2025). https://nrennie.rbind.io/art-of-viz/
https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/

Colour Palettes

Over 2500 colour palettes are available in the paletteer package. Can you find tayloRswift? wesanderson? harrypotter? timburton? You could also find/define palettes that are in line with your Company’s logo / colour schemes.

Here are the Qualitative Palettes: (searchable)

And the Quantitative/Continuous palettes: (searchable)

Use the commands:

## For Qual variable-> colour/fill:
scale_colour_paletteer_d(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)

## For Quant variable-> colour/fill:
scale_colour_paletteer_c(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)

See the paletteer gallery https://pmassicotte.github.io/paletteer_gallery/ for more information.

And also Emil Hvitfeldt’s paletteer page: <https://emilhvitfeldt.github.io/paletteer/