---
title: <iconify-icon icon="guidance:falling-rocks" width="1.2em" height="1.2em"></iconify-icon><iconify-icon icon="game-icons:falling" width="1.2em" height="1.2em"></iconify-icon> Facing the Abyss
date: 21/Oct/2023
date-modified: "`r Sys.Date()`"
author: Arvind V.
abstract-title: "Abstract"
abstract: "A complete EDA Workflow"
order: 200
image: preview.jpeg
image-alt: Image by rawpixel.com
code-tools: true
categories:
- EDA
- Workflow
- Descriptive
---
## An EDA / Statistical Analysis Process
So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?
The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible. And one uses Statistical procedures to help answer those questions quantitatively using models and tests.
## {{< iconify ic baseline-folder >}} Set up your Project
- Create a new Project in RStudio. `File -> New Project -> Quarto Blog`
- Create a new Quarto document: all your Quarto documents should be in the `posts/` folder. See the samples therein to get an idea.
- Save the document with a meaningful name, e.g. `EDA-Workflow-1.qmd`
- Create a new folder in the Project for your data files, e.g. `data/`. This can be at the [inside the `posts/` folder]{style="background-color:yellow;"}.
- Store all datasets within this folder, and refer to them with relative paths, e.g. `../data/mydata.csv` in any other Quarto document in the Project. (`../` means "go up one level from the current folder".)
Now edit the `*.qmd` file which you are editing for this report to include the following sections, YAML, code chunks, and text as needed.
::: callout-note
### Download ***this here*** document as a Work Template
Hit the `</>Code` button at upper right to copy/save this very document as a Quarto Markdown template for your work.
Delete the text that you don't need, but keep most of the Sections as they are!
:::
## {{< iconify noto-v1 package >}} Setting up R Packages
1. Install packages using `install.packages()` in your Console.
1. Load up your libraries in a `setup` chunk.
1. Set `df-print: paged` so that (long) data frames print as nice paged tables and do not overrun your HTML page.
1. Add `{knitr}` options to your YAML header, so that all your plots are rendered in high quality PNG format.
```yaml
title: "My Document"
format: html
df-print: paged
knitr:
opts_chunk:
dev: "ragg_png"
```
```{r}
#| label: setup
#| include: true
#| message: true
#| warning: true
#| knitr:
#| opts_chunk:
#| dev: "ragg_png"
library(tidyverse)
library(mosaic)
library(ggformula)
library(ggridges)
library(skimr)
library(janitor)
library(GGally)
library(corrplot)
library(corrgram)
library(crosstable) # Summary stats tables
library(kableExtra)
library(tinytable) # Static tables
library(DT) # Interactive tables
library(paletteer) # Colour Palettes for Peasants
##
## Add other packages here as needed, e.g.:
##
## scales/ggprism; # For scales / axes formatting
## ggstats/correlation; For Correlation Analysis
## vcd/vcdExtra/ggalluvial/ggpubr for Qual Data
## sf/tmap/osmplotr/rnaturalearth;
## igraph/tidygraph/ggraph/graphlayouts;
## harrypotter/wesanderson/tayloRswift;timburton;
```
::: callout-note
### Messages on Loading Packages
If you would rather not have all the messages and warnings from loading packages, you can set `message=FALSE` and `warning=FALSE` in the setup chunk above. But it is a good idea to read these messages at least once, since they often contain useful information about package versions, and any conflicts with other packages.
:::
### Use Namespace based Code
::: callout-warning
Did you notice that there are similarly-named function commands from different packages? E.g. `chisq.test` and `fisher.test` are in the `{janitor}` package and they mask the identically-named commands from `{stats}`. This is because the last-loaded package does the masking, but **all functions** from all loaded packages are available!! Just remember always to **name** your code-command with the package from whence it came!
So use `dplyr::filter()` / `dplyr::summarize()` and **not** just `filter()` or `summarize()`, since these commands could exist across multiple packages, which you may have loaded **last**. And you will avoid a lot of coding grief.
(One can also use the `{conflicted}` package to set this up, but this is simpler for beginners like us. )
:::
#### {{< iconify ic baseline-fonts >}} Themes and Fonts
Set up a theme for your plots. This is a good time to set up your own theme, or use an existing one, e.g. `{ggprism}`, `{ggthemes}`, `{ggpubr}`, etc. If you have a Company logo, you can use that as a theme too.
```{r}
#| label: Plot Sizing and theming
#| code-fold: true
#| message: false
#| warning: false
# Chunk options
knitr::opts_chunk$set(
fig.width = 7,
fig.asp = 0.618, # Golden Ratio
#out.width = "80%",
fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes
### https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto
# We have locally downloaded the `Alegreya` and `Roboto Condensed` fonts.
# They are located in a folder labelled `fonts` at the project root level.
# This ensures we are GDPR-compliant, and not using Google Fonts directly.
# Let us import these local fonts into our session and use them to define our ggplot theme.
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) #set DPI for showtext
sysfonts::font_add(family = "Alegreya",
regular = "../../../../../../fonts/Alegreya-Regular.ttf",
bold = "../../../../../../fonts/Alegreya-Bold.ttf",
italic = "../../../../../../fonts/Alegreya-Italic.ttf",
bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf")
sysfonts::font_add(family = "Roboto Condensed",
regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf")
showtext_auto(enable = TRUE) #enable showtext
##
theme_custom <- function(){
theme_bw(base_size = 10) +
theme_sub_axis(title = element_text(family = "Roboto Condensed",
size = 8),
text = element_text(family = "Roboto Condensed",
size = 6)) +
theme_sub_legend(text = element_text(family = "Roboto Condensed",
size = 6),
title = element_text(family = "Alegreya",
size = 8)) +
theme_sub_plot(title = element_text(family = "Alegreya",
size = 14, face = "bold"),
title.position = "plot",
subtitle = element_text(family = "Alegreya",
size = 10),
caption = element_text(family = "Alegreya",
size = 6),
caption.position = "plot")
}
## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
)
)
ggplot2::update_geom_defaults(geom = "label", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
)
)
ggplot2::update_geom_defaults(geom = "marquee", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
)
)
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
)
)
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
)
)
## Set the theme
ggplot2::theme_set(new = theme_custom())
## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)
## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")
```
## {{< iconify ic baseline-input >}} Read Data
- Use `readr::read_csv()`; or `data(...)` if the data is in a package. - **Do not** use `read.csv()`!!
```{r}
#| label: read-data
data(penguins, package = "datasets")
penguins <- penguins %>% janitor::clean_names(case = "snake") # Part of the process
penguins
```
## {{< iconify file-icons influxdata >}} Examine Data
- Use `dplyr::glimpse()`
- Use `skimr::skim()` OR `mosaic::inspect()`
- Use `dplyr::summarise()`
- Use `crosstable::crosstable()`
- Highlight any **interesting summary stats** or **data imbalances**,
especially across groups
```{r}
#| label: data-examine
dplyr::glimpse(penguins)
skimr::skim(penguins)
```
```{r}
#| label: data-examine-2
penguins %>% crosstable::crosstable(body_mass + bill_len + bill_dep ~ species) %>%
as_flextable() %>%
flextable::theme_vader() %>% # Darth Vadhyaar theme
flextable::fontsize(size = 10, part = "all") %>%
flextable::autofit()
```
::: callout-note
### Insights from Data Examination of `penguins`
- `body_mass`: Gentoo penguins look to be the heaviest, followed by Chinstrap and then Adelie
- `bill_length`: Adelie penguins have the shortest bills.
- `bill_depth`: No huge difference in bill depth across species
Etc.
:::
## {{< iconify streamline dictionary-language-book-solid >}} Data Dictionary and Experiment Description
- ***Data Dictionary***: A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord...)
- If there are *wrongly coded* variables in the original data, state them in their correct form, so you can munge the in the next step
- Declare what might be ***target*** and ***predictor*** variables, based on available information of the **experiment**, or a description of the data.
::: callout-note
### Qualitative Variables
- Categorical variables, e.g. `species`, `island`, `sex`
- Use `dplyr::count()` to get counts of each category
:::
::: callout-note
### Quantitative Variables
- Continuous variables, e.g. `body_mass_g`, `flipper_length_mm`, `bill_length_mm`
- Use `dplyr::summarise()` to get summary statistics of each variable
:::
## {{< iconify carbon clean >}} Data Munging
- Drop NA data ( since that is all we can do now. Sigh. )
- Convert variables to factors as needed
- Reformat / Rename other variables as needed
- Clean badly formatted columns (e.g. text + numbers) using `tidyr::separate_**_**()`
- **Save the data as a modified file**
- **Do not mess up the original data file**
- One Final Cleaned Data Table:
- Format your tables with `tinytable::tt()` OR `knitr::kable()`
- If you want interactive go with `DT::datatable()`
```{r}
#| label: data-munging
dataset_modified <- data %>%
janitor::clean_names(case = "snake") %>% # clean names
dplyr::filter(!is.na(target_variable)) %>% # drop NA in target variable or leave blank for all columns
dplyr::mutate(across(where(is.character), as.factor)) %>%
# Convert character variables to factors
# Check if other integer variables are actually factors
dplyr::relocate(where(is.factor), .after = rownames) # move factors to the right of rownames
# And so on
dataset_modified %>% tinytable::tt()
```
Munge the variables **separately** using `base::factor(levels = ..., labels = ..., ordered = ...) ` if you need to specify factor `labels` and `levels` for each variable.
## {{< iconify material-symbols lab-research >}} Form Hypotheses
### Question-1
- State the Question or Hypothesis
- (Temporarily) Choose relevant variables using `dplyr::select()`
- Create new variables if needed with `dplyr::mutate()`
- Filter the data set using `dplyr::filter()`
- Reformat data if needed with `tidyr::pivot_longer()` or `tidyr::pivot_wider()`
- Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
- Use `title`, `subtitle`, `legend` and `scales` appropriately in your chart
- Prefer `ggformula` unless you are using a chart that is not yet supported therein (eg. `ggbump()` or `plot_likert()`)
```{r}
#| label: figure-1
#| fig-showtext: true
#| fig-format: png
## Set graph theme
## Safe to set this up every chunk
theme_set(new = theme_custom())
penguins %>%
tidyr::drop_na() %>%
gf_point(body_mass ~ flipper_len,
colour = ~ species) %>%
gf_labs(title = "My First Penguins Plot",
subtitle = "Using ggformula with fonts",
x = "Flipper Length mm", y = "Body Mass gms",
caption = "I love penguins, and R\n But Arvind's class is another matter altogether")
```
### Inference-1
Description or Surprise Insight from the above graph/table/test/model, e.g.:
- Gentoo penguins are the largetst, followed by Chinstrap and then Adelie
- There is a positive correlation between flipper length and body mass, as seen by the general upward-right movement of the data points
- And so on...
.
.
.
### Question-n
State the Question or Hypothesis
### Inference-n
Description or Surprise Insight from the above graph/table/test/model
.
.
.
.
## {{< iconify fluent-mdl2 decision-solid >}} Conclusion
Describe what the graph/table/test/models show and why it is all so interesting. What could be done next?
## {{< iconify ooui references-rtl >}} References
1. Nicola Rennie.(2025). <https://nrennie.rbind.io/art-of-viz/>
1. <https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/>
##### Colour Palettes
Over 2500 colour palettes are available in the `paletteer` package. Can you find `tayloRswift`? `wesanderson`? `harrypotter`? `timburton`? You could also find/define palettes that are in line with your Company's logo / colour schemes.
<br><br>
Here are the Qualitative Palettes: (searchable)
<br><br>
```{r}
#| echo: false
library(paletteer)
library(wesanderson)
library(gameofthrones)
library(reactable)
palettes_d_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10)
```
<br><br>
And the Quantitative/Continuous palettes: (searchable)
<br><br>
```{r}
#| echo: false
palettes_c_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10)
```
<br><br>
Use the commands:
```{r}
#| eval: false
#| echo: true
## For Qual variable-> colour/fill:
scale_colour_paletteer_d(name = "Legend Name",
palette = "package::palette",
dynamic = TRUE/FALSE)
## For Quant variable-> colour/fill:
scale_colour_paletteer_c(name = "Legend Name",
palette = "package::palette",
dynamic = TRUE/FALSE)
```
See the `paletteer` gallery <https://pmassicotte.github.io/paletteer_gallery/> for more information.
And also Emil Hvitfeldt's `paletteer` page: <https://emilhvitfeldt.github.io/paletteer/