The Mad Hatter’s Guide to Data Viz and Stats in R
  1. Data Viz and Stats
  2. Workflow
  3. Facing the Abyss
  • Data Viz and Stats
    • Tools
      • Introduction to R and RStudio
    • Descriptive Analytics
      • Data
      • Inspect Data
      • Graphs
      • Summaries
      • Counts
      • Quantities
      • Groups
      • Distributions
      • Groups and Distributions
      • Change
      • Proportions
      • Parts of a Whole
      • Evolution and Flow
      • Ratings and Rankings
      • Surveys
      • Time
      • Space
      • Networks
      • Miscellaneous Graphing Tools, and References
    • Inference
      • Basics of Statistical Inference
      • 🎲 Samples, Populations, Statistics and Inference
      • Basics of Randomization Tests
      • Inference for a Single Mean
      • Inference for Two Independent Means
      • Inference for Comparing Two Paired Means
      • Comparing Multiple Means with ANOVA
      • Inference for Correlation
      • Testing a Single Proportion
      • Inference Test for Two Proportions
    • Modelling
      • Modelling with Linear Regression
      • Modelling with Logistic Regression
      • 🕔 Modelling and Predicting Time Series
    • Workflow
      • Facing the Abyss
      • I Publish, therefore I Am
      • Data Carpentry
    • Arts
      • Colours
      • Fonts in ggplot
      • Annotating Plots: Text, Labels, and Boxes
      • Annotations: Drawing Attention to Parts of the Graph
      • Highlighting parts of the Chart
      • Changing Scales on Charts
      • Assembling a Collage of Plots
      • Making Diagrams in R
    • AI Tools
      • Using gander and ellmer
      • Using Github Copilot and other AI tools to generate R code
      • Using LLMs to Explain Stat models
    • Case Studies
      • Demo:Product Packaging and Elderly People
      • Ikea Furniture
      • Movie Profits
      • Gender at the Work Place
      • Heptathlon
      • School Scores
      • Children's Games
      • Valentine’s Day Spending
      • Women Live Longer?
      • Hearing Loss in Children
      • California Transit Payments
      • Seaweed Nutrients
      • Coffee Flavours
      • Legionnaire’s Disease in the USA
      • Antarctic Sea ice
      • William Farr's Observations on Cholera in London
    • Projects
      • Project: Basics of EDA #1
      • Project: Basics of EDA #2
      • Experiments

On this page

  • 1 An EDA / Statistical Analysis Process
  • 2 Set up your Project
  • 3 Setting up R Packages
    • 3.1 Use Namespace based Code
  • 4 Read Data
  • 5 Examine Data
  • 6 Data Dictionary and Experiment Description
  • 7 Data Munging
  • 8 Form Hypotheses
    • 8.1 Question-1
    • 8.2 Inference-1
    • 8.3 Question-n
    • 8.4 Inference-n
  • 9 Conclusion
  • 10 References
  1. Data Viz and Stats
  2. Workflow
  3. Facing the Abyss

Facing the Abyss

  • Show All Code
  • Hide All Code

  • View Source
EDA
Workflow
Descriptive
Author

Arvind V.

Published

October 21, 2023

Modified

September 30, 2025

Abstract
A complete EDA Workflow

1 An EDA / Statistical Analysis Process

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible. And one uses Statistical procedures to help answer those questions quantitatively using models and tests.

2 Set up your Project

  • Create a new Project in RStudio. File -> New Project -> Quarto Blog
  • Create a new Quarto document: all your Quarto documents should be in the posts/ folder. See the samples therein to get an idea.
  • Save the document with a meaningful name, e.g. EDA-Workflow-1.qmd
  • Create a new folder in the Project for your data files, e.g. data/. This can be at the inside the posts/ folder.
  • Store all datasets within this folder, and refer to them with relative paths, e.g. ../data/mydata.csv in any other Quarto document in the Project. (../ means “go up one level from the current folder”.)

Now edit the *.qmd file which you are editing for this report to include the following sections, YAML, code chunks, and text as needed.

NoteDownload this here document as a Work Template

Hit the </>Code button at upper right to copy/save this very document as a Quarto Markdown template for your work. Delete the text that you don’t need, but keep most of the Sections as they are!

3 Setting up R Packages

  1. Install packages using install.packages() in your Console.
  2. Load up your libraries in a setup chunk.
  3. Set df-print: paged so that (long) data frames print as nice paged tables and do not overrun your HTML page.
  4. Add knitr options to your YAML header, so that all your plots are rendered in high quality PNG format.
title: "My Document"
format: html
df-print: paged
knitr:
  opts_chunk:
    dev: "ragg_png"
    
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mosaic)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
library(ggformula)
library(ggridges)
library(skimr)

Attaching package: 'skimr'

The following object is masked from 'package:mosaic':

    n_missing
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(GGally)
library(corrplot)
corrplot 0.95 loaded
library(corrgram)

Attaching package: 'corrgram'

The following object is masked from 'package:GGally':

    baseball

The following object is masked from 'package:lattice':

    panel.fill
library(crosstable) # Summary stats tables

Attaching package: 'crosstable'

The following object is masked from 'package:purrr':

    compact
library(kableExtra)

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
library(tinytable) # Static tables

Attaching package: 'tinytable'

The following object is masked from 'package:ggplot2':

    theme_void
library(DT) # Interactive tables
library(paletteer) # Colour Palettes for Peasants
##
## Add other packages here as needed, e.g.:
##
## scales/ggprism; # For scales / axes formatting
## ggstats/correlation; For Correlation Analysis
## vcd/vcdExtra/ggalluvial/ggpubr for Qual Data
## sf/tmap/osmplotr/rnaturalearth;
## igraph/tidygraph/ggraph/graphlayouts;
## harrypotter/wesanderson/tayloRswift;timburton;
NoteMessages on Loading Packages

If you would rather not have all the messages and warnings from loading packages, you can set message=FALSE and warning=FALSE in the setup chunk above. But it is a good idea to read these messages at least once, since they often contain useful information about package versions, and any conflicts with other packages.

3.1 Use Namespace based Code

Warning

Did you notice that there are similarly-named function commands from different packages? E.g. chisq.test and fisher.test are in the janitor package and they mask the identically-named commands from {stats}. This is because the last-loaded package does the masking, but all functions from all loaded packages are available!! Just remember always to name your code-command with the package from whence it came! So use dplyr::filter() / dplyr::summarize() and not just filter() or summarize(), since these commands could exist across multiple packages, which you may have loaded last. And you will avoid a lot of coding grief.

(One can also use the conflicted package to set this up, but this is simpler for beginners like us. )

Themes and Fonts

Set up a theme for your plots. This is a good time to set up your own theme, or use an existing one, e.g. ggprism, ggthemes, ggpubr, etc. If you have a Company logo, you can use that as a theme too.

Show the Code
# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes
### https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# We have locally downloaded the `Alegreya` and `Roboto Condensed` fonts.
# They are located in a folder labelled `fonts` at the project root level.
# This ensures we are GDPR-compliant, and not using Google Fonts directly.
# Let us import these local fonts into our session and use them to define our ggplot theme.
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 8
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

4 Read Data

  • Use readr::read_csv(); or data(...) if the data is in a package. - Do not use read.csv()!!
data(penguins, package = "datasets")

penguins <- penguins %>% janitor::clean_names(case = "snake") # Part of the process

penguins

5 Examine Data

  • Use dplyr::glimpse()
  • Use skimr::skim() OR mosaic::inspect()
  • Use dplyr::summarise()
  • Use crosstable::crosstable()
  • Highlight any interesting summary stats or data imbalances, especially across groups
dplyr::glimpse(penguins)
Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
skimr::skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_len 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_dep 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_len 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇
penguins %>%
  crosstable::crosstable(body_mass + bill_len + bill_dep ~ species) %>%
  as_flextable() %>%
  flextable::theme_vader() %>% # Darth Vadhyaar theme
  flextable::fontsize(size = 10, part = "all") %>%
  flextable::autofit()

label

variable

species

Adelie

Chinstrap

Gentoo

body_mass

Min / Max

2850.0 / 4775.0

2700.0 / 4800.0

3950.0 / 6300.0

Med [IQR]

3700.0 [3350.0;4000.0]

3700.0 [3487.5;3950.0]

5000.0 [4700.0;5500.0]

Mean (std)

3700.7 (458.6)

3733.1 (384.3)

5076.0 (504.1)

N (NA)

151 (1)

68 (0)

123 (1)

bill_len

Min / Max

32.1 / 46.0

40.9 / 58.0

40.9 / 59.6

Med [IQR]

38.8 [36.8;40.8]

49.5 [46.3;51.1]

47.3 [45.3;49.5]

Mean (std)

38.8 (2.7)

48.8 (3.3)

47.5 (3.1)

N (NA)

151 (1)

68 (0)

123 (1)

bill_dep

Min / Max

15.5 / 21.5

16.4 / 20.8

13.1 / 17.3

Med [IQR]

18.4 [17.5;19.0]

18.4 [17.5;19.4]

15.0 [14.2;15.7]

Mean (std)

18.3 (1.2)

18.4 (1.1)

15.0 (1.0)

N (NA)

151 (1)

68 (0)

123 (1)

NoteInsights from Data Examination of penguins
  • body_mass: Gentoo penguins look to be the heaviest, followed by Chinstrap and then Adelie
  • bill_length: Adelie penguins have the shortest bills.
  • bill_depth: No huge difference in bill depth across species

Etc.

6 Data Dictionary and Experiment Description

  • Data Dictionary: A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord…)
  • If there are wrongly coded variables in the original data, state them in their correct form, so you can munge the in the next step
  • Declare what might be target and predictor variables, based on available information of the experiment, or a description of the data.
NoteQualitative Variables
  • Categorical variables, e.g. species, island, sex
  • Use dplyr::count() to get counts of each category
NoteQuantitative Variables
  • Continuous variables, e.g. body_mass_g, flipper_length_mm, bill_length_mm
  • Use dplyr::summarise() to get summary statistics of each variable

7 Data Munging

  • Drop NA data ( since that is all we can do now. Sigh. )
  • Convert variables to factors as needed
  • Reformat / Rename other variables as needed
  • Clean badly formatted columns (e.g. text + numbers) using tidyr::separate_**_**()
  • Save the data as a modified file
  • Do not mess up the original data file
  • One Final Cleaned Data Table:
    • Format your tables with tinytable::tt() OR knitr::kable()
    • If you want interactive go with DT::datatable()
```{r}
#| label: data-munging

dataset_modified <- data %>% 
  janitor::clean_names(case = "snake") %>% # clean names
  
  dplyr::filter(!is.na(target_variable)) %>% # drop NA in target variable or leave blank for all columns
  
  dplyr::mutate(across(where(is.character), as.factor)) %>% 
  # Convert character variables to factors
  # Check if other integer variables are actually factors
  
  dplyr::relocate(where(is.factor), .after = rownames) # move factors to the right of rownames
# And so on

dataset_modified %>% tinytable::tt()

```

Munge the variables separately using base::factor(levels = ..., labels = ..., ordered = ...) if you need to specify factor labels and levels for each variable.

8 Form Hypotheses

8.1 Question-1

  • State the Question or Hypothesis
  • (Temporarily) Choose relevant variables using dplyr::select()
  • Create new variables if needed with dplyr::mutate()
  • Filter the data set using dplyr::filter()
  • Reformat data if needed with tidyr::pivot_longer() or tidyr::pivot_wider()
  • Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
  • Use title, subtitle, legend and scales appropriately in your chart
  • Prefer ggformula unless you are using a chart that is not yet supported therein (eg. ggbump() or plot_likert())
## Set graph theme
## Safe to set this up every chunk
theme_set(new = theme_custom())

penguins %>%
  tidyr::drop_na() %>%
  gf_point(body_mass ~ flipper_len,
    colour = ~species
  ) %>%
  gf_labs(
    title = "My First Penguins Plot",
    subtitle = "Using ggformula with fonts",
    x = "Flipper Length mm", y = "Body Mass gms",
    caption = "I love penguins, and R\n But Arvind's class is another matter altogether"
  )

8.2 Inference-1

Description or Surprise Insight from the above graph/table/test/model, e.g.:

  • Gentoo penguins are the largetst, followed by Chinstrap and then Adelie
  • There is a positive correlation between flipper length and body mass, as seen by the general upward-right movement of the data points
  • And so on… . . .

8.3 Question-n

State the Question or Hypothesis

8.4 Inference-n

Description or Surprise Insight from the above graph/table/test/model

. . . .

9 Conclusion

Describe what the graph/table/test/models show and why it is all so interesting. What could be done next?

10 References

  1. Nicola Rennie.(2025). https://nrennie.rbind.io/art-of-viz/
  2. https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/
Colour Palettes

Over 2500 colour palettes are available in the paletteer package. Can you find tayloRswift? wesanderson? harrypotter? timburton? You could also find/define palettes that are in line with your Company’s logo / colour schemes.



Here are the Qualitative Palettes: (searchable)



And the Quantitative/Continuous palettes: (searchable)



Use the commands:

## For Qual variable-> colour/fill:
scale_colour_paletteer_d(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)

## For Quant variable-> colour/fill:
scale_colour_paletteer_c(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)

See the paletteer gallery https://pmassicotte.github.io/paletteer_gallery/ for more information.

And also Emil Hvitfeldt’s paletteer page: <https://emilhvitfeldt.github.io/paletteer/

Back to top
Workflow
I Publish, therefore I Am
Source Code
---
title: <iconify-icon icon="guidance:falling-rocks" width="1.2em" height="1.2em"></iconify-icon><iconify-icon icon="game-icons:falling" width="1.2em" height="1.2em"></iconify-icon> Facing the Abyss
date: 21/Oct/2023
date-modified: "`r Sys.Date()`"
author: Arvind V. 
abstract-title: "Abstract"
abstract: "A complete EDA Workflow"
order: 200
image: preview.jpeg
image-alt: Image by rawpixel.com
code-tools: true
categories:
- EDA
- Workflow
- Descriptive

---


## An EDA / Statistical Analysis Process

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible. And one uses Statistical procedures to help answer those questions quantitatively using models and tests.


## {{< iconify ic baseline-folder >}} Set up your Project

- Create a new Project in RStudio. `File -> New Project -> Quarto Blog`
- Create a new Quarto document: all your Quarto documents should be in the `posts/` folder. See the samples therein to get an idea. 
- Save the document with a meaningful name, e.g. `EDA-Workflow-1.qmd`
- Create a new folder in the Project for your data files, e.g. `data/`. This can be at the [inside the `posts/` folder]{style="background-color:yellow;"}.
- Store all datasets within this folder, and refer to them with relative paths, e.g. `../data/mydata.csv` in any other Quarto document in the Project. (`../` means "go up one level from the current folder".)


Now edit the `*.qmd` file which you are editing for this report to include the following sections, YAML, code chunks, and text as needed.


::: callout-note
### Download ***this here*** document as a Work Template
Hit the `</>Code` button at upper right to copy/save this very document as a Quarto Markdown template for your work. 
Delete the text that you don't need, but keep most of the Sections as they are!

:::


## {{< iconify noto-v1 package >}} Setting up R Packages

1. Install packages using `install.packages()` in your Console. 
1. Load up your libraries in a `setup` chunk.
1. Set `df-print: paged` so that (long) data frames print as nice paged tables and do not overrun your HTML page. 
1. Add `{knitr}` options to your YAML header, so that all your plots are rendered in high quality PNG format.

```yaml
title: "My Document"
format: html
df-print: paged
knitr:
  opts_chunk:
    dev: "ragg_png"
    
```


```{r}
#| label: setup
#| include: true
#| message: true
#| warning: true
#| knitr:
#|   opts_chunk:
#|     dev: "ragg_png"

library(tidyverse)
library(mosaic)
library(ggformula)
library(ggridges)
library(skimr)
library(janitor) 
library(GGally)
library(corrplot)
library(corrgram)
library(crosstable) # Summary stats tables
library(kableExtra)
library(tinytable) # Static tables
library(DT) # Interactive tables
library(paletteer) # Colour Palettes for Peasants
##
## Add other packages here as needed, e.g.:
## 
## scales/ggprism; # For scales / axes formatting
## ggstats/correlation; For Correlation Analysis
## vcd/vcdExtra/ggalluvial/ggpubr for Qual Data
## sf/tmap/osmplotr/rnaturalearth; 
## igraph/tidygraph/ggraph/graphlayouts; 
## harrypotter/wesanderson/tayloRswift;timburton;


```

::: callout-note
### Messages on Loading Packages
If you would rather not have all the messages and warnings from loading packages, you can set `message=FALSE` and `warning=FALSE` in the setup chunk above. But it is a good idea to read these messages at least once, since they often contain useful information about package versions, and any conflicts with other packages.
:::

### Use Namespace based Code
::: callout-warning

Did you notice that there are similarly-named function commands from different packages? E.g. `chisq.test` and `fisher.test` are in the `{janitor}` package and they mask the identically-named commands from `{stats}`. This is because the last-loaded package does the masking, but **all functions** from all loaded packages are available!! Just remember always to **name** your code-command with the package from whence it came!
So use `dplyr::filter()` / `dplyr::summarize()` and **not** just `filter()` or `summarize()`, since these commands could exist across multiple packages, which you may have loaded **last**. And you will avoid a lot of coding grief. 

(One can also use the `{conflicted}` package to set this up, but this is simpler for beginners like us. )

:::



#### {{< iconify ic baseline-fonts >}} Themes and Fonts

Set up a theme for your plots. This is a good time to set up your own theme, or use an existing one, e.g. `{ggprism}`, `{ggthemes}`, `{ggpubr}`, etc. If you have a Company logo, you can use that as a theme too.

```{r}
#| label: Plot Sizing and theming
#| code-fold: true
#| message: false
#| warning: false

# Chunk options
knitr::opts_chunk$set(
 fig.width = 7,
 fig.asp = 0.618, # Golden Ratio
 #out.width = "80%",
 fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes
### https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# We have locally downloaded the `Alegreya` and `Roboto Condensed` fonts. 
# They are located in a folder labelled `fonts` at the project root level.
# This ensures we are GDPR-compliant, and not using Google Fonts directly.
# Let us import these local fonts into our session and use them to define our ggplot theme. 
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) #set DPI for showtext
sysfonts::font_add(family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf")

sysfonts::font_add(family = "Roboto Condensed", 
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf")
showtext_auto(enable = TRUE) #enable showtext
##
theme_custom <- function(){ 

    theme_bw(base_size = 10) + 
    
    theme_sub_axis(title = element_text(family = "Roboto Condensed", 
                                       size = 8),
                   text = element_text(family = "Roboto Condensed", 
                                       size = 6)) + 
    
    theme_sub_legend(text = element_text(family = "Roboto Condensed", 
                                         size = 6),
                     title = element_text(family = "Alegreya", 
                                          size = 8)) + 
    
    theme_sub_plot(title = element_text(family = "Alegreya", 
                                        size = 14, face = "bold"),
                   title.position = "plot",
                   subtitle = element_text(family = "Alegreya", 
                                           size = 10),
                   caption = element_text(family = "Alegreya", 
                                          size = 6),
                   caption.position = "plot")
    
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
)
)
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
)
)

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
)
)
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
)
)
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
)
)

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

```



## {{< iconify ic baseline-input >}} Read Data

- Use `readr::read_csv()`; or `data(...)` if the data is in a package. - **Do not** use `read.csv()`!!

```{r}
#| label: read-data
data(penguins, package = "datasets") 

penguins <- penguins %>% janitor::clean_names(case = "snake") # Part of the process

penguins

```

## {{< iconify file-icons influxdata >}} Examine Data

- Use `dplyr::glimpse()`
- Use `skimr::skim()` OR `mosaic::inspect()` 
- Use `dplyr::summarise()`
- Use `crosstable::crosstable()`
- Highlight any **interesting summary stats** or **data imbalances**, 
especially across groups

```{r}
#| label: data-examine
dplyr::glimpse(penguins)
skimr::skim(penguins)

```

```{r}
#| label: data-examine-2
penguins %>% crosstable::crosstable(body_mass + bill_len + bill_dep ~ species) %>% 
  as_flextable() %>% 
  flextable::theme_vader() %>% # Darth Vadhyaar theme
  flextable::fontsize(size = 10, part = "all") %>% 
  flextable::autofit()

```

::: callout-note
### Insights from Data Examination of `penguins`

- `body_mass`: Gentoo penguins look to be the heaviest, followed by Chinstrap and then Adelie
- `bill_length`: Adelie penguins have the shortest bills.
- `bill_depth`: No huge difference in bill depth across species

Etc.

:::
## {{< iconify streamline dictionary-language-book-solid >}} Data Dictionary and Experiment Description

- ***Data Dictionary***: A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord...)
- If there are *wrongly coded* variables in the original data, state them in their correct form, so you can munge the in the next step
- Declare what might be ***target*** and ***predictor*** variables, based on available information of the **experiment**, or a description of the data.

::: callout-note
### Qualitative Variables

- Categorical variables, e.g. `species`, `island`, `sex`
- Use `dplyr::count()` to get counts of each category
:::

::: callout-note
### Quantitative Variables

- Continuous variables, e.g. `body_mass_g`, `flipper_length_mm`, `bill_length_mm`
- Use `dplyr::summarise()` to get summary statistics of each variable
:::


## {{< iconify carbon clean >}} Data Munging

- Drop NA data ( since that is all we can do now. Sigh. )
- Convert variables to factors as needed
- Reformat / Rename other variables as needed
- Clean badly formatted columns (e.g. text + numbers) using `tidyr::separate_**_**()`
- **Save the data as a modified file**
- **Do not mess up the original data file**
- One Final Cleaned Data Table:
  - Format your tables with `tinytable::tt()` OR `knitr::kable()`
  - If you want interactive go with `DT::datatable()`

```{r}
#| label: data-munging

dataset_modified <- data %>% 
  janitor::clean_names(case = "snake") %>% # clean names
  
  dplyr::filter(!is.na(target_variable)) %>% # drop NA in target variable or leave blank for all columns
  
  dplyr::mutate(across(where(is.character), as.factor)) %>% 
  # Convert character variables to factors
  # Check if other integer variables are actually factors
  
  dplyr::relocate(where(is.factor), .after = rownames) # move factors to the right of rownames
# And so on

dataset_modified %>% tinytable::tt()

```

Munge the variables **separately** using `base::factor(levels = ..., labels = ..., ordered = ...) ` if you need to specify factor `labels` and `levels` for each variable.

## {{< iconify  material-symbols lab-research >}} Form Hypotheses


### Question-1

- State the Question or Hypothesis
- (Temporarily) Choose relevant variables using `dplyr::select()`
- Create new variables if needed with `dplyr::mutate()`
- Filter the data set using `dplyr::filter()`
- Reformat data if needed with `tidyr::pivot_longer()` or `tidyr::pivot_wider()`
- Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
- Use `title`, `subtitle`, `legend` and `scales` appropriately in your chart
- Prefer `ggformula` unless you are using a chart that is not yet supported therein (eg. `ggbump()` or `plot_likert()`)

```{r}
#| label: figure-1
#| fig-showtext: true
#| fig-format: png

## Set graph theme
## Safe to set this up every chunk
theme_set(new = theme_custom())

penguins %>% 
  tidyr::drop_na() %>% 
  gf_point(body_mass ~ flipper_len, 
           colour = ~ species) %>% 
  gf_labs(title = "My First Penguins Plot",
          subtitle = "Using ggformula with fonts",
          x = "Flipper Length mm", y = "Body Mass gms",
          caption = "I love penguins, and R\n But Arvind's class is another matter altogether")

```


### Inference-1

Description or Surprise Insight from the above graph/table/test/model, e.g.: 

- Gentoo penguins are the largetst, followed by Chinstrap and then Adelie
- There is a positive correlation between flipper length and body mass, as seen by the general upward-right movement of the data points
- And so on...
.
.
.

### Question-n
State the Question or Hypothesis

### Inference-n
Description or Surprise Insight from the above graph/table/test/model

.
.
.
.


## {{< iconify fluent-mdl2 decision-solid >}} Conclusion
Describe what the graph/table/test/models show and why it is all so interesting. What could be done next?

## {{< iconify ooui references-rtl >}} References

1. Nicola Rennie.(2025). <https://nrennie.rbind.io/art-of-viz/>
1. <https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/>

##### Colour Palettes

Over 2500 colour palettes are available in the `paletteer` package. Can you find `tayloRswift`? `wesanderson`? `harrypotter`? `timburton`? You could also find/define palettes that are in line with your Company's logo / colour schemes. 

<br><br>
Here are the Qualitative Palettes: (searchable)
<br><br>

```{r}
#| echo: false
library(paletteer)
library(wesanderson)
library(gameofthrones)
library(reactable)
palettes_d_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10)
```

<br><br>
And the Quantitative/Continuous palettes: (searchable)
<br><br>
```{r}
#| echo: false
palettes_c_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10)
```
<br><br>
Use the commands:

```{r}
#| eval: false
#| echo: true

## For Qual variable-> colour/fill: 
scale_colour_paletteer_d(name = "Legend Name", 
                          palette = "package::palette",
                          dynamic = TRUE/FALSE)
                          
## For Quant variable-> colour/fill: 
scale_colour_paletteer_c(name = "Legend Name", 
                          palette = "package::palette",
                          dynamic = TRUE/FALSE)

```

See the `paletteer` gallery <https://pmassicotte.github.io/paletteer_gallery/> for more information. 

And also Emil Hvitfeldt's `paletteer` page: <https://emilhvitfeldt.github.io/paletteer/

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .