The Mad Hatter’s Guide to Data Viz and Stats in R
  1. Data Viz and Stats
  2. Descriptive Analytics
  3. Data
  • Data Viz and Stats
    • Tools
      • Introduction to R and RStudio
    • Descriptive Analytics
      • Data
      • Inspect Data
      • Graphs
      • Summaries
      • Counts
      • Quantities
      • Groups
      • Distributions
      • Groups and Distributions
      • Change
      • Proportions
      • Parts of a Whole
      • Evolution and Flow
      • Ratings and Rankings
      • Surveys
      • Time
      • Space
      • Networks
      • Miscellaneous Graphing Tools, and References
    • Inference
      • Basics of Statistical Inference
      • 🎲 Samples, Populations, Statistics and Inference
      • Basics of Randomization Tests
      • Inference for a Single Mean
      • Inference for Two Independent Means
      • Inference for Comparing Two Paired Means
      • Comparing Multiple Means with ANOVA
      • Inference for Correlation
      • Testing a Single Proportion
      • Inference Test for Two Proportions
    • Modelling
      • Modelling with Linear Regression
      • Modelling with Logistic Regression
      • 🕔 Modelling and Predicting Time Series
    • Workflow
      • Facing the Abyss
      • I Publish, therefore I Am
      • Data Carpentry
    • Arts
      • Colours
      • Fonts in ggplot
      • Annotating Plots: Text, Labels, and Boxes
      • Annotations: Drawing Attention to Parts of the Graph
      • Highlighting parts of the Chart
      • Changing Scales on Charts
      • Assembling a Collage of Plots
      • Making Diagrams in R
    • AI Tools
      • Using gander and ellmer
      • Using Github Copilot and other AI tools to generate R code
      • Using LLMs to Explain Stat models
    • Case Studies
      • Demo:Product Packaging and Elderly People
      • Ikea Furniture
      • Movie Profits
      • Gender at the Work Place
      • Heptathlon
      • School Scores
      • Children's Games
      • Valentine’s Day Spending
      • Women Live Longer?
      • Hearing Loss in Children
      • California Transit Payments
      • Seaweed Nutrients
      • Coffee Flavours
      • Legionnaire’s Disease in the USA
      • Antarctic Sea ice
      • William Farr's Observations on Cholera in London
    • Projects
      • Project: Basics of EDA #1
      • Project: Basics of EDA #2
      • Experiments

On this page

  • 1 Setting up R Packages
    • 1.1 Using web-R
    • 1.2 Keyboard Shortcuts
  • 2 Where does Data come from?
  • 3 What are Data Types?
  • 4 How do we Spot Data Variable Types?
    • 4.1 Variables and Operations
    • 4.2 Variables and Hierarchy
  • 5 Some Examples of Data Variables
    • 5.1 Example 1: AllCountries
    • 5.2 Example 2:StudentSurveys
  • 6 Conclusion
  • 7 AI Generated Summary and Podcast
    • 7.1 Randomized Trials
  • 8 References
    • 8.1 R Package Citations
  1. Data Viz and Stats
  2. Descriptive Analytics
  3. Data

Data

Where does Data come from, what does it look like

Scientific Inquiry
Experiments
Observations
Nature of Data
Experience
Measurement
Author

Arvind V.

Published

November 1, 2021

Modified

September 30, 2025

Experiments and Observations

Experiments and Observations

“Difficulties strengthen the mind, as labor does the body.”

— Seneca

1 Setting up R Packages

library(tidyverse) # Data processing with tidy principles
library(mosaic) # Our go-to package for almost everything

# devtools::install_github("rpruim/Lock5withR")
library(Lock5withR)
library(Lock5Data) # Some neat little datasets from a lovely textbook
library(kableExtra)

Plot Fonts and Theme

Show the Code
library(systemfonts)
library(showtext)
library(ggrepel)
library(marquee)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  theme_bw(base_size = 10) +

    theme_sub_axis(
      title = element_text(
        family = "Roboto Condensed",
        size = 8
      ),
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      )
    ) +

    theme_sub_legend(
      text = element_text(
        family = "Roboto Condensed",
        size = 6
      ),
      title = element_text(
        family = "Alegreya",
        size = 8
      )
    ) +

    theme_sub_plot(
      title = element_text(
        family = "Alegreya",
        size = 14, face = "bold"
      ),
      title.position = "plot",
      subtitle = element_text(
        family = "Alegreya",
        size = 10
      ),
      caption = element_text(
        family = "Alegreya",
        size = 6
      ),
      caption.position = "plot"
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

ggplot2::update_geom_defaults(geom = "marquee", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)


## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")

1.1 Using web-R

This tutorial uses web-r that allows you to run all code within your browser, on all devices. Most code chunks herein are formatted in a tabbed structure ( like in an old-fashioned library) with duplicated code. The tabs in front have regular R code that will work when copy-pasted in your RStudio session. The tab “behind” has the web-R code that can work directly in your browser, and can be modified as well. The R code is also there to make sure you have original code to go back to, when you have made several modifications to the code on the web-r tabs and need to compare your code with the original!

1.2 Keyboard Shortcuts

  • Run selected code using either:
    • macOS: ⌘ + ↩︎/Return
    • Windows/Linux: Ctrl + ↩︎/Enter
  • Run the entire code by clicking the “Run code” button or pressing Shift+↩︎.
ImportantClick on any Picture to Zoom

All embedded figures are displayed full-screen when clicked.

2 Where does Data come from?

We will need to form a basic understanding of basic scientific enterprise. Let us look at the slides. (Also embedded below!)

View slides in full screen

3 What are Data Types?

 

ImportantTidy Data

Each variable is a column; a column contains one kind of data. Each observation or case is a row.

4 How do we Spot Data Variable Types?

By asking questions! Shown below is a table of different kinds of questions you could use to query a dataset. The variable or variables that “answer” the question would be in the category indicated by the question.

4.1 Variables and Operations

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation
2 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities with Scale. Differences are meaningful, but not products or ratios Quantitative/Interval pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College Mean,Standard Deviation
3 How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) Qualitative/Ordinal Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) Median,Percentile
4 What, Who, Where, Whom, Which Name, Place, Animal, Thing Qualitative/Nominal Name Count no. of cases,Mode

4.2 Variables and Hierarchy

As you go from Qualitative to Quantitative data types in the table, I hope you can detect a movement from fuzzy groups/categories to more and more crystallized numbers.

Type of Variables

Type of Variables

Each variable/scale can be subjected to the operations of the previous group. In the words of S.S. Stevens

the basic operations needed to create each type of scale is cumulative: to an operation listed opposite a particular scale must be added all those operations preceding it.

5 Some Examples of Data Variables

5.1 Example 1: AllCountries

  • Base R
  • web-r
head(AllCountries, 5) %>% arrange(desc(Internet))
NoteQuestions

Q1. How many people in Andorra have internet access?
A1. This leads to the Internet variable, which is a Quantitative variable, a proportion.1 The answer is \(70.5\%\).

5.2 Example 2:StudentSurveys

  • Base R
  • web-r
head(StudentSurvey, 5)
NoteQuestions

Q.1. What kind of students are these?
A.1. The variables Gender, and Year both answer to this Question. And they are both Qualitative/Categorical variables, of course.
Q.2. What is their status in their respective families?
A.2. Hmm…they are either first-born, or second-born, or third…etc. While this is recorded as a number, it is still a Qualitative variable2! Think! Can you do math operations with BirthOrder? Like mean or median?
Q.3.How big are the families?
A.3. Clearly, the variable that answers is Siblings and since the question is synonymous with “how many”, this is a Quantitative variable.

6 Conclusion

Let us take a look at Wickham and Grolemund’s Data Science workflow picture:

Figure 1: Data Science Workflow

So there we have it:

  • Data: We generate data by experiment, or obtain readily available data. We import and clean the data
  • Variables: Questions lead us to identify Types of Variables (Quant and Qual)
  • Transform: Sometimes we may need to transform the data (long to wide, summarize, create new variables…)
  • Explore: Further Questions lead us to infer relationships between variables, the relative size of things, which we describe using Data Visualizations
  • Report: This may be of interest, or best of all, outright surprising! Which is finally Communicated with charts and descriptions in a research report.

You might think of all these Questions, Answers, Mapping as being equivalent to a grammar, as a language in itself. And indeed, in R we use a philosophy called the Grammar of Graphics! We will use this grammar in the R graphics packages that we will encounter when we make Graphs next. Other parts of the Workflow (Transformation, Analysis and Modelling) are also following similar grammars, as we shall see.

7 AI Generated Summary and Podcast

This is a tutorial on data visualization using the R programming language. It introduces concepts such as data types, variables, and visualization techniques. The tutorial utilizes metaphors to explain these concepts, emphasizing the use of geometric aesthetics to represent data. It also highlights the importance of both visual and analytic approaches in understanding data. The tutorial then demonstrates basic chart types, including histograms, scatterplots, and bar charts, and discusses the “Grammar of Graphics” philosophy that guides data visualization in R. The text concludes with a workflow diagram for data science, emphasizing the iterative process of data import, cleaning, transformation, visualization, hypothesis generation, analysis, and communication.

Your browser does not support the audio tag; for browser support, please see: https://www.w3schools.com/tags/tag_audio.asp

7.1 Randomized Trials

These are the gold standard for experimental data. They involve randomly assigning subjects to treatment and control groups (e.g vaccine and no vaccine) to measure the effect of a treatment or intervention. This method helps eliminate bias and confounding variables, providing robust evidence for causal relationships.


8 References

  1. Martyn Shuttleworth, Lyndsay T Wilson (Jun 26, 2009). What is the Scientific Method? Retrieved Mar 12, 2024 from Explorable.com: https://explorable.com/what-is-the-scientific-method
  2. Adam E.M. Eltorai, Jeffrey A. Bakal, Paige C. Newell, Adena J. Osband (editors). (March 22, 2023) Translational Surgery: Handbook for Designing and Conducting Clinical and Translational Research. A very lucid and easily explained set of chapters. ( I have a copy. Yes.)
    • Part III. Clinical: fundamentals
    • Part IV: Statistical principles
  3. https://safetyculture.com/topics/design-of-experiments/
  4. Emi Tanaka. https://emitanaka.org/teaching/monash-wcd/2020/week09-DoE.html
  5. Open Intro Stats: Types of Variables
  6. Lock, Lock, Lock, Lock, and Lock. Statistics: Unlocking the Power of Data, Third Edition, Wiley, 2021. https://www.wiley.com/en-br/Statistics:+Unlocking+the+Power+of+Data,+3rd+Edition-p-9781119674160)
  7. Claus Wilke. Fundamentals of Data Visualization. https://clauswilke.com/dataviz/
  8. Tim C. Hesterberg (2015). What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum, The American Statistician, 69:4, 371-386, DOI:10.1080/00031305.2015.1089789. PDF here

8.1 R Package Citations

Package Version Citation
ggformula 0.12.2 Kaplan and Pruim (2025)
Lock5Data 3.0.0 Lock (2021)
mosaic 1.9.2 Pruim, Kaplan, and Horton (2017)
TeachingDemos 2.13 Snow (2024)
Kaplan, Daniel, and Randall Pruim. 2025. ggformula: Formula Interface to the Grammar of Graphics. https://doi.org/10.32614/CRAN.package.ggformula.
Lock, Robin. 2021. Lock5Data: Datasets for “Statistics: UnLocking the Power of Data”. https://doi.org/10.32614/CRAN.package.Lock5Data.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to ‘Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://doi.org/10.32614/CRAN.package.TeachingDemos.
Back to top

Footnotes

  1. How might this data have been obtained? By asking people in a survey and getting Yes/No answers!↩︎

  2. Qualitative variables are called Factor variables in R, and are stored, internally, as numeric variables together with their levels. The actual values of the numeric variable are 1, 2, and so on.↩︎

Citation

BibTeX citation:
@online{v.2021,
  author = {V., Arvind},
  title = {\textless Iconify-Icon Icon=“icon-Park-Twotone:data-User”
    Width=“1.2em”
    Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
    {Data}},
  date = {2021-11-01},
  url = {https://madhatterguide.netlify.app/content/courses/Analytics/10-Descriptive/Modules/05-NatureData/},
  langid = {en}
}
For attribution, please cite this work as:
V., Arvind. 2021. “<Iconify-Icon Icon=‘icon-Park-Twotone:data-User’ Width=‘1.2em’ Height=‘1.2em’></Iconify-Icon> Data.” November 1, 2021. https://madhatterguide.netlify.app/content/courses/Analytics/10-Descriptive/Modules/05-NatureData/.
Descriptive Analytics
Inspect Data

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .