library(tidyverse)
library(mosaic) # Our all-in-one package
library(skimr) # Looking at data
library(ggformula) # Our plotting package
library(visdat) # Mapping missing data
library(naniar) # Missing data visualization and munging
library(janitor) # Clean the data
library(tinytable) # Printing Tables for our data
library(DT) # Interactive Tables for our data
##
# devtools::install_github("rpruim/Lock5withR")
library(Lock5withR)
library(Lock5Data) # Some neat little datasets from a lovely textbook
Graphs
Charts and How they are generated from Data
““He is one of those who don’t want millions, but an answer to their questions.”
— Alyosha, in The Brothers Karamazov
1 Setting up R Packages
Plot Fonts and Theme
Show the Code
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
family = "Alegreya",
regular = "../../../../../../fonts/Alegreya-Regular.ttf",
bold = "../../../../../../fonts/Alegreya-Bold.ttf",
italic = "../../../../../../fonts/Alegreya-Italic.ttf",
bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)
sysfonts::font_add(
family = "Roboto Condensed",
regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
theme_bw(base_size = 10) +
theme_sub_axis(
title = element_text(
family = "Roboto Condensed",
size = 8
),
text = element_text(
family = "Roboto Condensed",
size = 6
)
) +
theme_sub_legend(
text = element_text(
family = "Roboto Condensed",
size = 6
),
title = element_text(
family = "Alegreya",
size = 8
)
) +
theme_sub_plot(
title = element_text(
family = "Alegreya",
size = 14, face = "bold"
),
title.position = "plot",
subtitle = element_text(
family = "Alegreya",
size = 10
),
caption = element_text(
family = "Alegreya",
size = 6
),
caption.position = "plot"
)
}
## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "marquee", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
## Set the theme
ggplot2::theme_set(new = theme_custom())
## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)
## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")
2 Why Visualize?
2.1 An Iconic Presentation
2.2 Some Reasons
- We can digest information more easily when it is pictorial
- Our Working Memories are both short-term and limited in capacity. So a picture abstracts the details and presents us with an overall summary, an insight, or a story that is both easy to recall and easy on retention.
- Data Viz includes shapes that carry strong cultural memories; and impressions for us. These cultural memories help us to use data viz in a universal way to appeal to a wide variety of audiences. (Do humans have a gene for geometry?1);
2.3 Some Pictures
- It helps sift facts from mere statements: for example:
- Visuals are a good starting point to make hypotheses of what may be happening in the situation represented by the data
3 Why Analyze?
3.1 Analysis
- Visualizations may not tell us the true magnitude or significance of things.
- We need analytic methods or statistics to assure ourselves that something is happening
- These methods remove human bias and ensure that we are speaking with the assurance that our problem deserves.
- Analysis uses numbers, or metrics, that allow us to crystallize our ambiguous words/guesses.
- These metrics are calculable from our data, of course, but are not directly visible, despite often being intuitive.
- Using these metrics, we need to become, paradoxically enough, sure of our uncertainty
So we need both visuals and analytics. And as we will see, we will not be content with that: we will visualize our analytics, and analyze our visualizations!
3.2 What is Tidy Data?
Let us recall first what we meant by tidy data:
3.3 Tidy Data Principles
- Each variable is a column;
- Each column contains one kind of data.
- Each observation or case is a row.
- Each observations contains one value for each variable.
4 What is a Data Visualization?
4.1 Data Viz = Data + Geometry
- How many geometric things do we know?
- Shapes? Lines? Axes? Curves? Angles? Patterns? Textures? Colours? Sizes? Positions? Lengths? Heights? Breadths? Radii? Textures?
- All these are geometric aspects or aesthetics, each with a unique property.
- Some “geometric things” which we might consider are shown in the figure below.
4.2 Mapping
- How can we manipulate these geometric aesthetics, perhaps like Kandinsky?
- The aesthetic has a property, an atribute, which we can manipulate in accordance with a data variable!
- This act of “mapping” a geometric thing to a variable and modifying its essential property is called Data Visualization
4.3 Mapping Examples
-
length
orheight
of abar
can be made proportional to theage
orincome
of a person -
Colour
of points can be mapped togender
, with a uniquecolour
for eachgender
. -
Position
along an X-axis can vary in accordance with aheight
variable, and -
Position
along the Y-axis can vary with abodyWeight
variable.
4.4 Using Multiple Geometries
- A chart may use more than one aesthetic:
position
,shape
,colour
,height
andangle
,pattern
ortexture
to name several. - Usually, each aesthetic is mapped to just one variable to ensure there is no cognitive error.
- There is of course a choice and you should be able to map any kind of variable to any geometric aspect/aesthetic that may be available.
4.5 A Natural Mapping
- Note that here is also a “natural” mapping between aesthetic and kind of variable Quantitative or Qualitative as seen in
- For instance,
shape
is rarely mapped to a Quantitative variable; - the nature of variation between the Quantitative variable and the
shape
aesthetic is not similar (i.e. not continuous). - Bad choices may lead to bad, or worse, misleading charts!
4.6 A Data Visualization Example
Show the Code
set.seed(1947)
diamonds %>%
slice_sample(n = 150, weight_by = cut) %>%
gf_point(price ~ carat,
colour = ~cut,
shape = ~cut,
size = 2, data = .
) %>%
gf_labs(
title = "Plot Title = DIAMONDS ARE FOREVER",
subtitle = "Plot Subtitle = AND A GIRL'S BEST FRIEND",
caption = "Plot Caption = From the diamonds dataset",
x = "x-Axis Title = CARAT",
y = "y-Axis Title = PRICE"
) %>%
# Use same name for scales to merge legends
gf_refine(
scale_color_brewer(
name = "Legend = DIAMOND QUALITY",
palette = "Set1"
),
scale_shape_manual(
name = "Legend = DIAMOND QUALITY",
values = c(15:21)
)
) %>%
gf_annotate("text",
x = 1.0, y = 16000,
label = "These DIAMONDS are\n Super Affordable!!",
fontface = "bold",
size = 2
) %>%
gf_annotate("curve",
x = 0.9,
y = 14500,
yend = 8000,
xend = 0.95,
linewidth = 0.5,
curvature = 0.5,
arrow = arrow(length = unit(0.25, "cm"))
) %>%
gf_annotate(
"rect",
xmin = 1,
xmax = 1.25,
ymin = 2250,
ymax = 10000,
alpha = 0.5,
fill = "grey80",
col = "black"
)
4.7 What were the Components?
- In the above chart, it is pretty clear what kind of variable is plotted on the
x-axis
and they-axis
. - The dominant geometry is a
point
, whoseposition
is determined by thex
andy
variables. - The
shape
of thepoint
is determined by thecut
variable - What about
colour
? Could this be considered as another axis in the chart? - There are also other aspects that you can choose (not explicitly shown here) such as the
plot theme
(colours, fonts, backgrounds etc) - which may not be mapped to data, but are nonetheless choices to be made.
- We will get acquainted with this aspect as we build charts.
4.8 Transformations
- As we will see, Data Variables may be transformed before being mapped to some geometric aesthetic
- e.g. we may perform counts with a Qual variable that contains only the entries
{S, M, L, XL}
. - We may also transform the
axes
(make them logarithmic, or even polar ) to create precisely the shape-meaning we wish. - This allows us considerable flexibility in making charts!!
4.9 Facets
- Finally, if the graph is too busy, with lots of colours and shapes, then we can split the graph into many small multiples or facets, each showing a subset of the data.
- This is called faceting and is a powerful way to reduce cognitive load on the viewer.
Show the Code
set.seed(1947)
diamonds %>%
slice_sample(n = 150, weight_by = cut) %>%
gf_point(price ~ carat | clarity,
colour = ~cut,
shape = ~cut,
size = 2, data = .
) %>%
gf_labs(
title = "Plot Title = DIAMONDS ARE FOREVER",
subtitle = "Plot Subtitle = AND A GIRL'S BEST FRIEND",
caption = "Plot Caption = From the diamonds dataset",
x = "x-Axis Title = CARAT",
y = "y-Axis Title = PRICE"
) %>%
# Use same name for scales to merge legends
gf_refine(
scale_color_brewer(
name = "Legend = DIAMOND QUALITY",
palette = "Set1"
),
scale_shape_manual(
name = "Legend = DIAMOND QUALITY",
values = c(15:21)
)
)
5 Basic Types of Charts
5.1 Mapping Variables to Aesthetics
- We can therefore think of simple visualizations as combinations of aesthetics, mapped to combinations of variables.
- It should be possible to use the many shapes we know, or can conceive of, and marry them to data to create a brand new visualization method that advances both understanding and retention! You should try!!
5.2 Mappings and Charts: A Catalogue
Variable #1 | Variable #2 | Chart Names | Chart Shape |
---|---|---|---|
Quant | None | Histogram and Density |
|
Qual | None | Bar Chart | |
Quant | Quant | Scatter Plot, Line Chart, Bubble Plot, Area Chart |
|
Quant | Qual | Pie Chart, Donut Chart, Column Chart, Box-Whisker Plot, Radar Chart, Bump Chart, Tree Diagram |
|
Qual | Qual | Stacked Bar Chart, Mosaic Chart, Sankey, Chord Diagram, Network Diagram |
|
6 Conclusion
6.1 Data Science Workflow
6.2 Workflow Description
So there we have it:
- Data: We generate data by experiment, or obtain readily available data. We import and clean the data
- Variables: Questions lead us to identify Types of Variables (Quant and Qual)
- Transform: Sometimes we may need to transform the data (long to wide, summarize, create new variables…)
- Explore: Further Questions lead us to infer relationships between variables, the relative size of things, which we describe using Data Visualizations
- Report: This may be of interest, or best of all, outright surprising! Which is finally Communicated with charts and descriptions in a research report.
6.3 Grammar of Data Visualization
You might think of all these Questions, Answers, Mapping as being equivalent to a grammar, as a language in itself.
And indeed, in R we use a philosophy called the Grammar of Graphics! We will use this grammar in the R graphics packages that we will encounter when we make Graphs next.
Other parts of the Workflow (Transformation, Facetting, Analysis and Modelling) also fall within the grammar, as we shall see.
7 AI Generated Summary and Podcast
7.1 Summary
This is a tutorial on data visualization using the R programming language. It introduces concepts such as data types, variables, and visualization techniques. The tutorial utilizes metaphors to explain these concepts, emphasizing the use of geometric aesthetics to represent data. It also highlights the importance of both visual and analytic approaches in understanding data. The tutorial then demonstrates basic chart types, including histograms, scatterplots, and bar charts, and discusses the “Grammar of Graphics” philosophy that guides data visualization in R. The text concludes with a workflow diagram for data science, emphasizing the iterative process of data import, cleaning, transformation, visualization, hypothesis generation, analysis, and communication.
8 References
- Claus Wilke. Fundamentals of Data Visualization. https://clauswilke.com/dataviz/
- Kieran Healy. Data Visualization: A Practical Introduction. https://socviz.co/
- Winston Chang. R Graphics Cookbook. https://r-graphics.org/
- Hadley Wickham and Garrett Grolemund. R for Data Science. https://r4ds.had.co.nz/
- Jack Dougherty and Ilya Ilyankou. Hands-On Data Visualization. https://handsondataviz.org/
- Albert Rapp. Adding images to ggplot. https://albert-rapp.de/posts/ggplot2-tips/27_images/27_images
Footnotes
Citation
@online{v.2021,
author = {V., Arvind},
title = {\textless Iconify-Icon Icon=“carbon:chart-3d” Width=“1.2em”
Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
{Graphs}},
date = {2021-11-01},
url = {https://madhatterguide.netlify.app/content/courses/Analytics/10-Descriptive/Modules/09-Graphs/},
langid = {en}
}